From patchwork Tue Aug 28 17:22:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578883 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4F50E5A4 for ; Tue, 28 Aug 2018 17:23:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 30677286C0 for ; Tue, 28 Aug 2018 17:23:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 23E622A579; Tue, 28 Aug 2018 17:23:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8EAC7286C0 for ; Tue, 28 Aug 2018 17:23:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AE7026B4738; Tue, 28 Aug 2018 13:23:25 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A954E6B4739; Tue, 28 Aug 2018 13:23:25 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 95D6B6B473A; Tue, 28 Aug 2018 13:23:25 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f200.google.com (mail-yb0-f200.google.com [209.85.213.200]) by kanga.kvack.org (Postfix) with ESMTP id 6B4896B4738 for ; Tue, 28 Aug 2018 13:23:25 -0400 (EDT) Received: by mail-yb0-f200.google.com with SMTP id 203-v6so1079906ybp.8 for ; Tue, 28 Aug 2018 10:23:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=6LP4jYlbkEnBcnLxdEdJzxq3VgpVfWiBPMH3IQ/af2M=; b=TNvLqeQ2GKedoIgrwfElxzXc+0mPqErAtL1EXtEc30ETKFBQIV0BWdiUujUG5dTQb1 UVcH6az6J2+NODrK9k8B+/UvyF1odqusspk3ykY2asNf0ItkF36VZvJlvJeHUMeJe1Xe AXbihHI5RW6EPQhNAJOlKceyBW9aVQvbXak7cwnMqjY4eHMi0HBmacc576f73WiNLbxR ht/Z0RGz1Z9zUXWTruuqA99a6OqHwuI1rLgxtn8dv6z6XzDlicdavr3TstNsenssbZnH wAn0C4lhgIhJCc5vxxl52hmBufqrnfSZ4MQfJ4mub63G1h1TLohI3nh1j6mUGMrQgkOx zcFQ== X-Gm-Message-State: APzg51Cwf8FXF1JCsbGjuA8IYkvr//nebatPMNK3nfvKtTMY2Di5oJMw QlPDzIkMDdOWS40ckx+Whxg0kGeFTJhoJRf+a/Vgr1Rdj29xQVO42nIC7xm6zAIybJ+hpeMU2+U ERCuYZvrI0qmcJbrV6xySC9M+hD55/tRRFBBL69kYtA6CxVGs3a1PpaHi2ATwcEAT5zRESpkEwR anwM2bssP2X06MxQZ/Z+QfUlDZ/yNUPAsDzFIF00zu6jwYN14mUsjFaDOOR0MjPawKny6Bu0eIJ kDt31gq4zbumcsFb147wv5HO4aNImCN21NHqtyxyVuxBQiCAGwC9c0htj6aNKtiGQod56J1q9rJ UmKnFHo8rh/ePYqezhuNzovxngHNGmxx6ZOtcsB0N/TJjD461BbwxFeLomDWbk9WUkUiAbCBicK 1 X-Received: by 2002:a25:bdc4:: with SMTP id g4-v6mr573327ybk.518.1535477005149; Tue, 28 Aug 2018 10:23:25 -0700 (PDT) X-Received: by 2002:a25:bdc4:: with SMTP id g4-v6mr573305ybk.518.1535477004419; Tue, 28 Aug 2018 10:23:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477004; cv=none; d=google.com; s=arc-20160816; b=U9kWXv3woAvgbKWYAPRWj4EOeOBrSIzqLwN3lVqNJMRaHCmd2i+2qdVeOGOBSoNmvF +WOPIlIE3d8HWXGh2auJUrB1x+7UCdxR/s9+w+e3HQ+ZVG1KJPhfilWQH6em4rx0lPCV UjGf/XesJwCcg11gJaFalQMYQxTYtEc7WrKKUqzucjqdZvaUklbjJWwPAvMVfQ+XQ+6X 8Zyz2Owf2XMLg9Irt8sNRXBO+rHbHXLCYZ73Ru8frC+kJcn8tmT5B9WjdPtaVkshdfqd PfP/r/8VimBXjG6pNCw7PUHRMOYY2/7YeWtHQUn0+y+jrAxM7Qf1F4MEAGODxJsbd3Xo Kgfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=6LP4jYlbkEnBcnLxdEdJzxq3VgpVfWiBPMH3IQ/af2M=; b=XNrIwtJ0Bfxge9Ja6cI3lNKyZWPClmr8akl2oXCprbMpbtr4Jg9LjRrbKITsYjSK0k zTOn8iZMEq7h1vJc/x54THyFtu9vSHpf0leGRjgW85GwjtasLom7U9/9x0UuwwYhXzEA 42uwvrEYEwDyCfn6YgaigcThaEBnNg0RtVQ9dvUh8DDIPopt5/MYSTFnDlwlZLSuXczs 3Xz5TyW/lrTTwoLos+Jur1Ot4reN32JIongDuxuIL0Oy02xa8tqvP6jqOtdImEefq/xr 5obaDGCF/m5PIIXofYUyERrZp8VQEP5JbmmqID0Hm7T0OcBzTZNXYjjlxKCqmhEs8+sp dXug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=1VST8BwL; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id l124-v6sor390362ywg.152.2018.08.28.10.23.21 for (Google Transport Security); Tue, 28 Aug 2018 10:23:21 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=1VST8BwL; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=6LP4jYlbkEnBcnLxdEdJzxq3VgpVfWiBPMH3IQ/af2M=; b=1VST8BwL/2+J3rqHHlYBmAySQOD3S7m02Nic1jjLJcfoHxI8+mwYve17/BJZ1m03lm kRVchpfKFipfKjExKOaozqJk24pyutKM23M0yyaUW1Y75DxmgodD8EBE4FUnPBWGkC3v QzJ7JPI6Y8ZyrfJXkJIW9+S/Ack5MkA1EWfKelcFc0/vOtym7aD4kD/Y5UIy9D3Z6LgB f4/Gu1Ksf+I26xfika+1g+imCP2BqrH3XjSZPNh+5FfHwOn4qIHdGk2yH5Tl6qR0sYjn AcKbmMfJUfgH1sCh6rU1rfM/lFq1Xxgvb/enMbtVwswr2ta+1n4NTADp06MNEhidpYtg WNZg== X-Google-Smtp-Source: ANB0VdZAij0mXhndy16AixHc4IHud8DBf3FghkG2ia0L7VNTNXix/ML/G1k6yN73Mc8LEYjLfzs8Sw== X-Received: by 2002:a81:3295:: with SMTP id y143-v6mr1401869ywy.418.1535477001360; Tue, 28 Aug 2018 10:23:21 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id i190-v6sm1129807ywc.60.2018.08.28.10.23.20 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:20 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/9] mm: workingset: don't drop refault information prematurely Date: Tue, 28 Aug 2018 13:22:50 -0400 Message-Id: <20180828172258.3185-2-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Johannes Weiner If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. Signed-off-by: Johannes Weiner --- mm/workingset.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 40ee02c83978..53759a3cf99a 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, { unsigned long max_nodes; unsigned long nodes; - unsigned long cache; + unsigned long pages; /* list_lru lock nests inside the IRQ-safe i_pages lock */ local_irq_disable(); @@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, * * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE */ - if (sc->memcg) { - cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid, - LRU_ALL_FILE); - } else { - cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) + - node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE); - } - max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3); +#ifdef CONFIG_MEMCG + if (sc->memcg) + pages = page_counter_read(&sc->memcg->memory); + else +#endif + pages = node_present_pages(sc->nid); + + max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3); if (nodes <= max_nodes) return 0; From patchwork Tue Aug 28 17:22:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578885 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D2F2F5A4 for ; Tue, 28 Aug 2018 17:23:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B4B0A286C0 for ; Tue, 28 Aug 2018 17:23:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A8D002A579; Tue, 28 Aug 2018 17:23:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 736F0286C0 for ; Tue, 28 Aug 2018 17:23:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 196816B4739; Tue, 28 Aug 2018 13:23:27 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0FD5D6B473A; Tue, 28 Aug 2018 13:23:26 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6C9F6B473C; Tue, 28 Aug 2018 13:23:26 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f70.google.com (mail-yw1-f70.google.com [209.85.161.70]) by kanga.kvack.org (Postfix) with ESMTP id 86EEF6B4739 for ; Tue, 28 Aug 2018 13:23:26 -0400 (EDT) Received: by mail-yw1-f70.google.com with SMTP id q141-v6so996026ywg.5 for ; Tue, 28 Aug 2018 10:23:26 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=9pv5Yc7BWzaFaIbPzHYAa3fodLWKuUPKvT2vF5h02lw=; b=SwivGeaVzMZdPrMwDRUf/1D7Men4fvg86i+bPf/f4Pkaca3FcxOm9F+sCd9E9e8P0Q IjT/Xuv8wfa5fNFUwISL8yO/w2YUfs+NCr0zpXa//dygbj8KIUABjDmPodDXkVXgP8qk GXYX3ON5tJxx/5eiFMg2kDpTj2Q4r/x2vJN0n02MUoVgYQZVZQWocyD3LaOxJdfcIHK8 jMdWAWJzlbUmRE9HHXhrfbmbepMpds55YjT1PJ5Q5e7+gDaBzfsVMcjrB9vNIYsDOeN/ cSIR67S6KVdl0uTRJc7nmFozsrqUlJEsFqfXGoehzfAAfgEUBcLQKov4PC5Bt0COoehj LgHg== X-Gm-Message-State: APzg51DS6uxUx0yEJIl6kYesFL78rhXhLwpQUIqJnKA3hEih2JnO5y1Z pG3Rl/EpXXbd1vE071W4zPwD9DydbZsK930G5+GfNEIYIGclendFekBP5NCRyJdiDovb6Thvb98 AyhrPf4yVwu5gNo2z9LS/EVojYSAufGcV1Q/KShzFk3rt0K90KtTkxucQxKTluoP4kGPpG5lS+W nAjz7H3H5JweoeQnTxI7U6yNMLCrZT48cohyWVTCjIUyLiQIdczkY/WagEopoxwlKx3NIjv9NgP r+HcAKXmxFM5EinlYrp67Crjd3JRccGA8WLPmJiSQN+JzESAaN+vJp2hdMADNAgx45HOYzu1JJ9 4uo0x30oYPoR1sqeguRX7L8Pj0s2dKjBZPR2I0PGo0eQvs6KMBFr98uncAIbfKTB6Pax+ti+kJm W X-Received: by 2002:a25:3dc2:: with SMTP id k185-v6mr1434975yba.0.1535477006220; Tue, 28 Aug 2018 10:23:26 -0700 (PDT) X-Received: by 2002:a25:3dc2:: with SMTP id k185-v6mr1434913yba.0.1535477004435; Tue, 28 Aug 2018 10:23:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477004; cv=none; d=google.com; s=arc-20160816; b=MxcTW9eEY/zZDlTziLG2DfErLHl9W86mj5FVZda6/dl7aUOOSDLdvB6/v+Vcc4K7er tGwP6KoJoU+Mf1sZi5/xa2LprHFmAWZ03cYa+q/ix2fdx7boZ8cmnRDPfTWdYaXHrQHp m6n52PP+M1JR8HXVsDg2zwH1sNaK9KegFXQi4inimoVoyCWaosVZpzXdZnyn7BUdfNaY H7O/96/ZPdRVV1qce2Sbv5GrYl1r3m+gUnIRRBtaJJQiW7lHCrCGiivxEPSanfMRwzc6 JbBxAVC6i3emR7qJY3gaBR7mbxcYmnfSedecqLZtsrtflCc/59VdKzm/SrNTHJXixWc7 AVbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=9pv5Yc7BWzaFaIbPzHYAa3fodLWKuUPKvT2vF5h02lw=; b=vGPowMI6g5mF9pcH2FBJXs1UV1+0+nWCKTwOMZUnkcDkE82cNAjNv7ewEZAgG9GIys EegXKSv3O/NIBiILpS/+caT/l6nl2oSlkdZG8uGVIJoFN2i6slvt+5VgpkueK7oTuYSW fiw1ni8Zo5VJSrc941VzyElBLW8emFKAYsWSzGXZQLKc5mmlLlHg7WmQP3GSKsZhk/k7 k7drCGxxrhVuME/NqE8l3H2nMjkXVs7gSnNJON7li9d4wuIP5C2sC7AXItF0exboxfpS dhhXucNzgZC88YWniYA4EfclW1tB0WGQ9RK+wbcTRUMy+Wy62h0XBg2d0oUKyViYs46a LpWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=HcoPUr8P; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id u132-v6sor390946ywg.507.2018.08.28.10.23.23 for (Google Transport Security); Tue, 28 Aug 2018 10:23:23 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=HcoPUr8P; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=9pv5Yc7BWzaFaIbPzHYAa3fodLWKuUPKvT2vF5h02lw=; b=HcoPUr8PHMfUHdeOw/5VKojeZV10/HTF/gZoWe7rUTw2WZ/OyM2TGlABoV3SUL9HRK NjEgXfW33IipYougb030Dhrxwufdlnh6OS5iOAUq/HBiQ2nYRXoVJwhBjpsbuy4zcfKY AeK2pqqkESjx3Es9OR18dXQv8UqkxgRx8Iy72Dz2VoA1PRc1AYWA4Eqen8YMNH02O+PO xYYkSWg6+F0a5kNSlpim69nvS2488U1b2qZI3EDaSW70mgRvgAIKeT4Bo4gNqULmHP1T XhAt/nlYj/aLYsoeeQpydjbSapllqJeAi9ZM+YZWDxicHCR+6crsjFaSqA6cvC6Q8WQu bzPA== X-Google-Smtp-Source: ANB0VdYDXsGdBjEXimXVRGOZ+hARrhb765ZjxtaBorm8cRcesp+cpspcJ1jYhzdD8mRhQBcCd7dmJQ== X-Received: by 2002:a81:138e:: with SMTP id 136-v6mr1379201ywt.485.1535477003293; Tue, 28 Aug 2018 10:23:23 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id r126-v6sm585624ywb.92.2018.08.28.10.23.22 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:22 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Date: Tue, 28 Aug 2018 13:22:51 -0400 Message-Id: <20180828172258.3185-3-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. v4: - fix a typo in the comments, as per Suren Signed-off-by: Johannes Weiner --- include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + mm/filemap.c | 9 ++-- mm/huge_memory.c | 1 + mm/memcontrol.c | 2 + mm/migrate.c | 2 + mm/swap_state.c | 1 + mm/vmscan.c | 1 + mm/vmstat.c | 1 + mm/workingset.c | 95 ++++++++++++++++++++++------------ 12 files changed, 79 insertions(+), 42 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 32699b2dc52a..6af87946d241 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -163,6 +163,7 @@ enum node_stat_item { NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, + WORKINGSET_RESTORE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 901943e4754b..79346bc1da7a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -69,13 +69,14 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ - PG_error, PG_referenced, PG_uptodate, PG_dirty, PG_lru, PG_active, + PG_workingset, PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ + PG_error, PG_slab, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, @@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) +PAGEFLAG(Workingset, workingset, PF_HEAD) + TESTCLEARFLAG(Workingset, workingset, PF_HEAD) __PAGEFLAG(Slab, slab, PF_NO_TAIL) __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL) PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */ diff --git a/include/linux/swap.h b/include/linux/swap.h index c063443d8638..d8822365782b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -296,7 +296,7 @@ struct vma_swap_readahead { /* linux/mm/workingset.c */ void *workingset_eviction(struct address_space *mapping, struct page *page); -bool workingset_refault(void *shadow); +void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); /* Do not use directly, use workingset_lookup_update */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a81cffb76d89..a1675d43777e 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -88,6 +88,7 @@ {1UL << PG_dirty, "dirty" }, \ {1UL << PG_lru, "lru" }, \ {1UL << PG_active, "active" }, \ + {1UL << PG_workingset, "workingset" }, \ {1UL << PG_slab, "slab" }, \ {1UL << PG_owner_priv_1, "owner_priv_1" }, \ {1UL << PG_arch_1, "arch_1" }, \ diff --git a/mm/filemap.c b/mm/filemap.c index 52517f28e6f4..5e53424d9097 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, * data from the working set, only to cache data that will * get overwritten with something else, is a waste of memory. */ - if (!(gfp_mask & __GFP_WRITE) && - shadow && workingset_refault(shadow)) { - SetPageActive(page); - workingset_activation(page); - } else - ClearPageActive(page); + WARN_ON_ONCE(PageActive(page)); + if (!(gfp_mask & __GFP_WRITE) && shadow) + workingset_refault(page, shadow); lru_cache_add(page); } return ret; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 25346bd99364..04d663c58bbe 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2369,6 +2369,7 @@ static void __split_huge_page_tail(struct page *head, int tail, (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | + (1L << PG_workingset) | (1L << PG_locked) | (1L << PG_unevictable) | (1L << PG_dirty))); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b2173f7e5164..84824b775470 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5329,6 +5329,8 @@ static int memory_stat_show(struct seq_file *m, void *v) stat[WORKINGSET_REFAULT]); seq_printf(m, "workingset_activate %lu\n", stat[WORKINGSET_ACTIVATE]); + seq_printf(m, "workingset_restore %lu\n", + stat[WORKINGSET_RESTORE]); seq_printf(m, "workingset_nodereclaim %lu\n", stat[WORKINGSET_NODERECLAIM]); diff --git a/mm/migrate.c b/mm/migrate.c index 8c0af0f7cab1..a6a9114e62dc 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -682,6 +682,8 @@ void migrate_page_states(struct page *newpage, struct page *page) SetPageActive(newpage); } else if (TestClearPageUnevictable(page)) SetPageUnevictable(newpage); + if (PageWorkingset(page)) + SetPageWorkingset(newpage); if (PageChecked(page)) SetPageChecked(newpage); if (PageMappedToDisk(page)) diff --git a/mm/swap_state.c b/mm/swap_state.c index ecee9c6c4cc1..0d6a7f268d2e 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -448,6 +448,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Initiate read into locked page and return. */ + SetPageWorkingset(new_page); lru_cache_add_anon(new_page); *new_page_allocated = true; return new_page; diff --git a/mm/vmscan.c b/mm/vmscan.c index 03822f86f288..7fdbc18fea6f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1976,6 +1976,7 @@ static void shrink_active_list(unsigned long nr_to_scan, } ClearPageActive(page); /* we are de-activating */ + SetPageWorkingset(page); list_add(&page->lru, &l_inactive); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ba0870ecddd..28f2faad95d4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = { "nr_isolated_file", "workingset_refault", "workingset_activate", + "workingset_restore", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", diff --git a/mm/workingset.c b/mm/workingset.c index 53759a3cf99a..f1bbce55ea60 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -121,7 +121,7 @@ * the only thing eating into inactive list space is active pages. * * - * Activating refaulting pages + * Refaulting inactive pages * * All that is known about the active list is that the pages have been * accessed more than once in the past. This means that at any given @@ -134,6 +134,10 @@ * used less frequently than the refaulting page - or even not used at * all anymore. * + * That means if inactive cache is refaulting with a suitable refault + * distance, we assume the cache workingset is transitioning and put + * pressure on the current active list. + * * If this is wrong and demotion kicks in, the pages which are truly * used more frequently will be reactivated while the less frequently * used once will be evicted from memory. @@ -141,6 +145,14 @@ * But if this is right, the stale pages will be pushed out of memory * and the used pages get to stay in cache. * + * Refaulting active pages + * + * If on the other hand the refaulting pages have recently been + * deactivated, it means that the active list is no longer protecting + * actively used cache from reclaim. The cache is NOT transitioning to + * a different workingset; the existing workingset is thrashing in the + * space allocated to the page cache. + * * * Implementation * @@ -156,8 +168,7 @@ */ #define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \ - NODES_SHIFT + \ - MEM_CGROUP_ID_SHIFT) + 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) /* @@ -170,23 +181,28 @@ */ static unsigned int bucket_order __read_mostly; -static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) +static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, + bool workingset) { eviction >>= bucket_order; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; + eviction = (eviction << 1) | workingset; eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); } static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp) + unsigned long *evictionp, bool *workingsetp) { unsigned long entry = (unsigned long)shadow; int memcgid, nid; + bool workingset; entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; + workingset = entry & 1; + entry >>= 1; nid = entry & ((1UL << NODES_SHIFT) - 1); entry >>= NODES_SHIFT; memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); @@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); *evictionp = entry << bucket_order; + *workingsetp = workingset; } /** @@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, */ void *workingset_eviction(struct address_space *mapping, struct page *page) { - struct mem_cgroup *memcg = page_memcg(page); struct pglist_data *pgdat = page_pgdat(page); + struct mem_cgroup *memcg = page_memcg(page); int memcgid = mem_cgroup_id(memcg); unsigned long eviction; struct lruvec *lruvec; @@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page) lruvec = mem_cgroup_lruvec(pgdat, memcg); eviction = atomic_long_inc_return(&lruvec->inactive_age); - return pack_shadow(memcgid, pgdat, eviction); + return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } /** * workingset_refault - evaluate the refault of a previously evicted page + * @page: the freshly allocated replacement page * @shadow: shadow entry of the evicted page * * Calculates and evaluates the refault distance of the previously * evicted page in the context of the node it was allocated in. - * - * Returns %true if the page should be activated, %false otherwise. */ -bool workingset_refault(void *shadow) +void workingset_refault(struct page *page, void *shadow) { unsigned long refault_distance; + struct pglist_data *pgdat; unsigned long active_file; struct mem_cgroup *memcg; unsigned long eviction; struct lruvec *lruvec; unsigned long refault; - struct pglist_data *pgdat; + bool workingset; int memcgid; - unpack_shadow(shadow, &memcgid, &pgdat, &eviction); + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); rcu_read_lock(); /* @@ -263,41 +280,51 @@ bool workingset_refault(void *shadow) * configurations instead. */ memcg = mem_cgroup_from_id(memcgid); - if (!mem_cgroup_disabled() && !memcg) { - rcu_read_unlock(); - return false; - } + if (!mem_cgroup_disabled() && !memcg) + goto out; lruvec = mem_cgroup_lruvec(pgdat, memcg); refault = atomic_long_read(&lruvec->inactive_age); active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); /* - * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. + * Calculate the refault distance * - * There is a special case: usually, shadow entries have a - * short lifetime and are either refaulted or reclaimed along - * with the inode before they get too old. But it is not - * impossible for the inactive_age to lap a shadow entry in - * the field, which can then can result in a false small - * refault distance, leading to a false activation should this - * old entry actually refault again. However, earlier kernels - * used to deactivate unconditionally with *every* reclaim - * invocation for the longest time, so the occasional - * inappropriate activation leading to pressure on the active - * list is not a problem. + * The unsigned subtraction here gives an accurate distance + * across inactive_age overflows in most cases. There is a + * special case: usually, shadow entries have a short lifetime + * and are either refaulted or reclaimed along with the inode + * before they get too old. But it is not impossible for the + * inactive_age to lap a shadow entry in the field, which can + * then result in a false small refault distance, leading to a + * false activation should this old entry actually refault + * again. However, earlier kernels used to deactivate + * unconditionally with *every* reclaim invocation for the + * longest time, so the occasional inappropriate activation + * leading to pressure on the active list is not a problem. */ refault_distance = (refault - eviction) & EVICTION_MASK; inc_lruvec_state(lruvec, WORKINGSET_REFAULT); - if (refault_distance <= active_file) { - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); - rcu_read_unlock(); - return true; + /* + * Compare the distance to the existing workingset size. We + * don't act on pages that couldn't stay resident even if all + * the memory was available to the page cache. + */ + if (refault_distance > active_file) + goto out; + + SetPageActive(page); + atomic_long_inc(&lruvec->inactive_age); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + + /* Page was active prior to eviction */ + if (workingset) { + SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } +out: rcu_read_unlock(); - return false; } /** From patchwork Tue Aug 28 17:22:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578891 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2B0495A4 for ; Tue, 28 Aug 2018 17:23:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0E6C9286C0 for ; Tue, 28 Aug 2018 17:23:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 024A92A579; Tue, 28 Aug 2018 17:23:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4F1B1286C0 for ; Tue, 28 Aug 2018 17:23:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B3416B473C; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7632A6B473E; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C7176B4740; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f69.google.com (mail-yw1-f69.google.com [209.85.161.69]) by kanga.kvack.org (Postfix) with ESMTP id 0C5B36B473C for ; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) Received: by mail-yw1-f69.google.com with SMTP id w23-v6so978639ywg.11 for ; Tue, 28 Aug 2018 10:23:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=rYqRJD/Nn/mjyDEIlHij4Nq2ZQJyQDLbJcRnxwRjma8=; b=A0zJZxDDsXHtGFvMladlXdhjdmBbju4Q4Snw/fYC4syKsCq15Df4xnTyB4WKBINnim e1PGmhw9+78opZpfm/cac2G3jbWvLKDHw4NKWkxWs7IuEAu4UPfzK7DWh5ZgTV8WE90d ogQXQDkgLbo9yoQudXyZqvlcxi7XkNT7wZTh5lY+hggodKiYc5jQh+1GOyYeRJ2qUbgV cVPttMlaEU5rzWpbvq3LZVyPyjsIpPUDdj+myeUDFFgw/N2pqpLQuS5kV0oFzoMEZ3zN AxL1uPy1ctrwVacXIAacbpk70LNO3PAJO4HlSc56+PQKFR0jRi/roQR/tuwi+EWzP8XZ UkVQ== X-Gm-Message-State: APzg51A1y8QWAii2vINMZAT0Olb11HGs3aY1WaspdGYSwijpNJqOYVqe IwEHRbqLwUxrz7Pz+rP7sFiy16qsE3shYz5gWaXHop0Ua7lDPKcN4rruRmB5RH2g6lPFNKdWmg3 x/3uLk2zxuzrmEEBgKhokPH8JF9EXLQE//gn42x1v8+rQRLfUp+4APqKMcIErsFkvVFAwgZA73a bB3HlV9XLLwy72we1SiKh1pMx9vP9+rNJ7IUxisvI/9M4kTdS4Xq4zWg34tPdmmDnuAEfbNlp0F Kwf5AmLbbckBXmTxSV96UxfM/vN4HdHjuRsrBFbGv8t46fjioBvLKO1lv9NKtGZkzexPHqWW81F 70l69ajEtLcMzeAx6Y+0uyFS/Fq3uEdQOrY/N6UR0Q7vQO69yTzQko1YTjv5WV7wxdI3RdYyBhd X X-Received: by 2002:a25:2984:: with SMTP id p126-v6mr1402441ybp.514.1535477008791; Tue, 28 Aug 2018 10:23:28 -0700 (PDT) X-Received: by 2002:a25:2984:: with SMTP id p126-v6mr1402403ybp.514.1535477007848; Tue, 28 Aug 2018 10:23:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477007; cv=none; d=google.com; s=arc-20160816; b=wKeGFJDrd9VWCKRBG1vAjnBFT7GOEp6AJOhP9gf2BpF/7Z1omJEVH1R87TnpyeUmwB UY/RhjnSpbctW3T6f6GxNxOQ5nGqFaWAKZbK7vFXNN8FJSFeZBYFznECSZcmo0v3vdLS xqHwtyYxDmjPwYID4zAJma/o+vg61kVfrFQ4nO2EEYBGSzbCrc8ZgPmQCM09+qc9Ibgg tE05UIoixPbo1/2NHVC67oStXZBsVvJp3zVm2wD/3EluZzsEnTuJky2ZmtXfx0+CzwPE rbhPrB0BtFw2w0asikbXNDwuFibAB7bqm9iKVbu6g4Iwx/HJf0mngeIMWopJxa+L3AaR s2RA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=rYqRJD/Nn/mjyDEIlHij4Nq2ZQJyQDLbJcRnxwRjma8=; b=QMx7vP5oUsZakYMojVE5kzj6sc59OofO2HDOr9aPv6gOpRll16bcR5qVG0r4jPHxWY HAIuZQEHsJ8S/ZA6EX4hFDtcMgHMVB3QO0p/2ikgb/EK6EEhWoGV1eFiVkHDLeFyJ3c7 jY/tfT/thwtg350q/jig/0GgWfPP8tvIEVNoe6hbjJ97kddAyTX8Az8pReaFblHiG/ov stIb756iQ1/Rtl164V8v8fVUEEfOsoEUFWxCgMZy39eGmKlWGaiCwEHk8WGeBVw2iUmX 2qRj/xFYPQKOcyuteHeu113zj/HKG1xVHlPsSx8u1ToqvO2iswXgtmKQ4qkJTe0l4pw5 L01w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=rv6b+n5M; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 185-v6sor430549ybr.4.2018.08.28.10.23.25 for (Google Transport Security); Tue, 28 Aug 2018 10:23:25 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=rv6b+n5M; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=rYqRJD/Nn/mjyDEIlHij4Nq2ZQJyQDLbJcRnxwRjma8=; b=rv6b+n5M5/s6153mcQJP8vb8uQTvc0ZkerEzMVY/d84AUjep2LMXmRZe1FWs1qjyXF s0y0Ha0UgpGEJTASxmj542o50DJGU1z1eH3au+YA4fExeolwZJJ5hJBuH4Tau6DKmtxx SB/JZUuFL9anXIHerqONyisv6lsGmSnn8vJfn9deTIyTgidwM+3hGJm/z0JA81qWhsLf 4dQgA3HjMEuD/3l6gkLH4TFX8U1/eJUq+MNxqI91DS/OI9iF6jUKjBz/OpVN5l1ziDrn Kfirj7wk4y8q/udRMCGkLXNyJgUVZKh3tY/O/DLXkJkib73qxfeUKXRcAxGoSLAH2RbV hRgg== X-Google-Smtp-Source: ANB0VdYcmdWXjf09BCqTnjZc3KIcNh2/avH9YSJ2USJIxkxAPyEpIv0mY9KySK7FjbOWOTftW3VZAQ== X-Received: by 2002:a25:103:: with SMTP id 3-v6mr1411840ybb.421.1535477005285; Tue, 28 Aug 2018 10:23:25 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id c126-v6sm1609448ywa.104.2018.08.28.10.23.23 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:24 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/9] delayacct: track delays from thrashing cache pages Date: Tue, 28 Aug 2018 13:22:52 -0400 Message-Id: <20180828172258.3185-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Delay accounting already measures the time a task spends in direct reclaim and waiting for swapin, but in low memory situations tasks spend can spend a significant amount of their time waiting on thrashing page cache. This isn't tracked right now. To know the full impact of memory contention on an individual task, measure the delay when waiting for a recently evicted active cache page to read back into memory. Also update tools/accounting/getdelays.c: [hannes@computer accounting]$ sudo ./getdelays -d -p 1 print delayacct stats ON PID 1 CPU count real total virtual total delay total delay average 50318 745000000 847346785 400533713 0.008ms IO count delay total delay average 435 122601218 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms THRASHING count delay total delay average 19 12621439 0ms Signed-off-by: Johannes Weiner --- include/linux/delayacct.h | 23 +++++++++++++++++++++++ include/uapi/linux/taskstats.h | 6 +++++- kernel/delayacct.c | 15 +++++++++++++++ mm/filemap.c | 11 +++++++++++ tools/accounting/getdelays.c | 8 +++++++- 5 files changed, 61 insertions(+), 2 deletions(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 31c865d1842e..577d1b25fccd 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -57,7 +57,12 @@ struct task_delay_info { u64 freepages_start; u64 freepages_delay; /* wait for memory reclaim */ + + u64 thrashing_start; + u64 thrashing_delay; /* wait for thrashing page */ + u32 freepages_count; /* total count of memory reclaim */ + u32 thrashing_count; /* total count of thrash waits */ }; #endif @@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); extern __u64 __delayacct_blkio_ticks(struct task_struct *); extern void __delayacct_freepages_start(void); extern void __delayacct_freepages_end(void); +extern void __delayacct_thrashing_start(void); +extern void __delayacct_thrashing_end(void); static inline int delayacct_is_task_waiting_on_io(struct task_struct *p) { @@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void) __delayacct_freepages_end(); } +static inline void delayacct_thrashing_start(void) +{ + if (current->delays) + __delayacct_thrashing_start(); +} + +static inline void delayacct_thrashing_end(void) +{ + if (current->delays) + __delayacct_thrashing_end(); +} + #else static inline void delayacct_set_flag(int flag) {} @@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void) {} static inline void delayacct_freepages_end(void) {} +static inline void delayacct_thrashing_start(void) +{} +static inline void delayacct_thrashing_end(void) +{} #endif /* CONFIG_TASK_DELAY_ACCT */ diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h index b7aa7bb2349f..5e8ca16a9079 100644 --- a/include/uapi/linux/taskstats.h +++ b/include/uapi/linux/taskstats.h @@ -34,7 +34,7 @@ */ -#define TASKSTATS_VERSION 8 +#define TASKSTATS_VERSION 9 #define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN * in linux/sched.h */ @@ -164,6 +164,10 @@ struct taskstats { /* Delay waiting for memory reclaim */ __u64 freepages_count; __u64 freepages_delay_total; + + /* Delay waiting for thrashing page */ + __u64 thrashing_count; + __u64 thrashing_delay_total; }; diff --git a/kernel/delayacct.c b/kernel/delayacct.c index ca8ac2824f0b..2a12b988c717 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -135,9 +135,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp; tmp = d->freepages_delay_total + tsk->delays->freepages_delay; d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp; + tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay; + d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp; d->blkio_count += tsk->delays->blkio_count; d->swapin_count += tsk->delays->swapin_count; d->freepages_count += tsk->delays->freepages_count; + d->thrashing_count += tsk->delays->thrashing_count; raw_spin_unlock_irqrestore(&tsk->delays->lock, flags); return 0; @@ -169,3 +172,15 @@ void __delayacct_freepages_end(void) ¤t->delays->freepages_count); } +void __delayacct_thrashing_start(void) +{ + current->delays->thrashing_start = ktime_get_ns(); +} + +void __delayacct_thrashing_end(void) +{ + delayacct_end(¤t->delays->lock, + ¤t->delays->thrashing_start, + ¤t->delays->thrashing_delay, + ¤t->delays->thrashing_count); +} diff --git a/mm/filemap.c b/mm/filemap.c index 5e53424d9097..ca895ebe43ac 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -36,6 +36,7 @@ #include #include #include +#include #include "internal.h" #define CREATE_TRACE_POINTS @@ -1073,8 +1074,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; + bool thrashing = false; int ret = 0; + if (bit_nr == PG_locked && !PageSwapBacked(page) && + !PageUptodate(page) && PageWorkingset(page)) { + delayacct_thrashing_start(); + thrashing = true; + } + init_wait(wait); wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; @@ -1113,6 +1121,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, finish_wait(q, wait); + if (thrashing) + delayacct_thrashing_end(); + /* * A signal could leave PageWaiters set. Clearing it here if * !waitqueue_active would be possible (by open-coding finish_wait), diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c index 9f420d98b5fb..8cb504d30384 100644 --- a/tools/accounting/getdelays.c +++ b/tools/accounting/getdelays.c @@ -203,6 +203,8 @@ static void print_delayacct(struct taskstats *t) "SWAP %15s%15s%15s\n" " %15llu%15llu%15llums\n" "RECLAIM %12s%15s%15s\n" + " %15llu%15llu%15llums\n" + "THRASHING%12s%15s%15s\n" " %15llu%15llu%15llums\n", "count", "real total", "virtual total", "delay total", "delay average", @@ -222,7 +224,11 @@ static void print_delayacct(struct taskstats *t) "count", "delay total", "delay average", (unsigned long long)t->freepages_count, (unsigned long long)t->freepages_delay_total, - average_ms(t->freepages_delay_total, t->freepages_count)); + average_ms(t->freepages_delay_total, t->freepages_count), + "count", "delay total", "delay average", + (unsigned long long)t->thrashing_count, + (unsigned long long)t->thrashing_delay_total, + average_ms(t->thrashing_delay_total, t->thrashing_count)); } static void task_context_switch_counts(struct taskstats *t) From patchwork Tue Aug 28 17:22:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578889 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0EE17175A for ; Tue, 28 Aug 2018 17:23:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E5C12286C0 for ; Tue, 28 Aug 2018 17:23:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D99DC2A579; Tue, 28 Aug 2018 17:23:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 35D56286C0 for ; Tue, 28 Aug 2018 17:23:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36E056B473D; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2F97E6B473E; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 125EF6B473F; Tue, 28 Aug 2018 13:23:29 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f199.google.com (mail-yb0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id CF0D66B473D for ; Tue, 28 Aug 2018 13:23:28 -0400 (EDT) Received: by mail-yb0-f199.google.com with SMTP id s27-v6so1091473ybe.14 for ; Tue, 28 Aug 2018 10:23:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=8QpVWc+T0o4AoowMcuXkIsCB4Tg1PG7pOnkiVXdsCB8=; b=P/c6rWtwfTvq3Sz9gh6vLnBtRj/5jSSZR4XdIxTNOa7+aaa7ZpDw6pR47l0ijuC9C+ F0bw+tBhiosxrOIlE6ZQf4FVxbwF7etpKmD6mIo9RFG8fgXVBeYiFyArdDghpti8K8zE cn0Zr9nCYs3yf5K7s60DheMVCeYtYJcUubH7yriHhAepqH83S48sR4n9MWvzJszNEFk9 Zf3KmhnbPvHU0LJCM0bPYyKICSXNN9xJNiIZ+UMuPSNWJkOvoyyWL784xiP+RD3CZKaE ljrLXznVLH4rKaLhOIFbPAvMYd54NrNpOFjyqD5/+kZhcutfti4O8UCjUjdvuM5xXEAp j9Vw== X-Gm-Message-State: APzg51BYRsAs7MvlI8tZsO61qXcjXd4bnbEqglTZk/qHuXcryR8VfdTw RxrUMcDwRRuYGhLASajWNfrakSMQP+fCZTuHpFMBg1Ji9p+rd78Ot1atsZTaO+Kza+vc6Hg6jEy eafj2+hzxBpBoNTZmyFu4GrPR9UMQwWc/bP9MjMZyOc3/aspHbZmbveIri/76t25dIRmegNvio8 onGArdpTMKvnAkK8kW/qmccwYfUycJLF55jFqbe1uW1tleRiek0h9WnA3sFMqvt1KS9V25jFjVL y0qEKbqQVHVPXdGuERaT5mOtQwDf/R2KW4OaJ3FNqMfjKADrDoCnOZKB2w8q1x1udaLmZfgJpsc qGVOM3k2xbZ8G3GFlVz1pV5G43Xdnmjmc/8HW3vsk6AdCo/v2Pjbj+4xOjMVYg7Ru95eKrO35rx s X-Received: by 2002:a81:e8c:: with SMTP id 134-v6mr1384960ywo.11.1535477008524; Tue, 28 Aug 2018 10:23:28 -0700 (PDT) X-Received: by 2002:a81:e8c:: with SMTP id 134-v6mr1384919ywo.11.1535477007607; Tue, 28 Aug 2018 10:23:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477007; cv=none; d=google.com; s=arc-20160816; b=b0vqeg6DTCiyY0YElE69BZ/FKIOHLzSR+RcQMEKVq2oCF7czDAzEWUGceqN0K8mIyt 3u1Hl+vu6XYUn+ogOHbfIwY9xiBV9wagGSd1DcubQ6WEo6+O5HbBiEHSL16PpXQpS01q VgqrTRkLysdPAAmqi+YrkbHyjnBZft9AHlOXsstvtv7A4SeY+xs5FElHY9txR0KpDbhT pbJfmktvP9BbkIy0c8IT5xlQQRje2v+SZewRdqoqG5gJjZTeWZ3KDliIqD+oZb0T7iCN iGD4SsweN6Xu2vStLq9HB6pWYxVe7dmQUrbDIcpKFEIb+GtsKVLe5LZ9MCZqEl4lq6lR /8wA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=8QpVWc+T0o4AoowMcuXkIsCB4Tg1PG7pOnkiVXdsCB8=; b=eBQl17oLftuaiGOLZXRYj+HOC/jD86pMveF1UsS9/8QrONIZI6renGHvK+rdku/rg0 xW7UeQBfcFr3gkQKqgaEtCoBV4NoPhZm0ax5CXX6nFIGT6Hh3dhYmzsE81KHCdC6bgVh GZts3+PViS2ZXRGlPy7IJdPFeqtcraf1v3Gnj9zMrPGBTQsvU2JhMpWuLchN007iYFvG LruG2IS/uajJ+nhZIxa7RY4j5ahJIjyX+J5E+0SlcncjfgGOWMT85QcpHRksAtrpWvQk dQLRAq4Unk/XgTy6p8MmMImqOSiD0gPg3HPF/qQTCuHRnxkUlSsBqnqMrjfm1NXN3WAF YYvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=WXvGmsNs; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id y136-v6sor396254ywy.18.2018.08.28.10.23.27 for (Google Transport Security); Tue, 28 Aug 2018 10:23:27 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=WXvGmsNs; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=8QpVWc+T0o4AoowMcuXkIsCB4Tg1PG7pOnkiVXdsCB8=; b=WXvGmsNsQW+tu3R06QMjVFUXUSjeh4QQVywmK/96rOggdOqu0xDxk5zOr0p/7yb3Xl mJMhGrjs43wcQRGGuMS5Pm4GaCdPg3iCCBfORS4nupG4/X9M3STdwi64UYXAJJvaFBSh uxTeEO7gT84J69qhUJk1LGEH8wXj4Myh9RtFe5w3vLal/JabzwFt57Z9v/mj2wxlQVxX a0W52JiJ3pimWlPUk72tPMzruc47joJpk/v8u2G+mnXxGYVoZCwViiYNYpQJTrNcNMe3 u6iInnuQLyl6Gfh6pjPY3oCfb1Ciuk81BkY+VH4PlEPxyNZSEb6kwcR5i6hIAj7Ez4kU jI3A== X-Google-Smtp-Source: ANB0VdbdCfLmTatvdN7aKlDqbukasjcQeDodU0Px5oOYJjk64rf40mw7WrL+FnhFT3cadlkff9U/VQ== X-Received: by 2002:a81:7852:: with SMTP id t79-v6mr1374450ywc.329.1535477007278; Tue, 28 Aug 2018 10:23:27 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id f5-v6sm587873ywd.53.2018.08.28.10.23.25 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:26 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Date: Tue, 28 Aug 2018 13:22:53 -0400 Message-Id: <20180828172258.3185-5-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. Signed-off-by: Johannes Weiner Signed-off-by: Johannes Weiner --- .../platforms/cell/cpufreq_spudemand.c | 2 +- arch/powerpc/platforms/cell/spufs/sched.c | 9 +++----- arch/s390/appldata/appldata_os.c | 4 ---- drivers/cpuidle/governors/menu.c | 4 ---- fs/proc/loadavg.c | 3 --- include/linux/sched/loadavg.h | 21 +++++++++++++++---- kernel/debug/kdb/kdb_main.c | 7 +------ kernel/sched/loadavg.c | 15 ------------- 8 files changed, 22 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c index 882944c36ef5..5d8e8b6bb1cc 100644 --- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c +++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c @@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info) cpu = info->policy->cpu; busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus); - CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1); + info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1); pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n", cpu, busy_spus, info->busy_spus); diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c index c9ef3c532169..9fcccb4490b9 100644 --- a/arch/powerpc/platforms/cell/spufs/sched.c +++ b/arch/powerpc/platforms/cell/spufs/sched.c @@ -987,9 +987,9 @@ static void spu_calc_load(void) unsigned long active_tasks; /* fixed-point */ active_tasks = count_active_contexts() * FIXED_1; - CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks); - CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks); - CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks); + spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks); + spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks); + spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks); } static void spusched_wake(struct timer_list *unused) @@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx, } } -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int show_spu_loadavg(struct seq_file *s, void *private) { int a, b, c; diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 433a994b1a89..54f375627532 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -25,10 +25,6 @@ #include "appldata.h" - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - /* * OS data * diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 1aef60d160eb..e508d08b7ccb 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -131,10 +131,6 @@ struct menu_device { int interval_ptr; }; - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static inline int get_loadavg(unsigned long load) { return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10; diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index d06694757201..8468baee951d 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -10,9 +10,6 @@ #include #include -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index 80bc84ba5d2a..cc9cc62bb1f8 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); #define EXP_5 2014 /* 1/exp(5sec/5min) */ #define EXP_15 2037 /* 1/exp(5sec/15min) */ -#define CALC_LOAD(load,exp,n) \ - load *= exp; \ - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; +/* + * a1 = a0 * e + a * (1 - e) + */ +static inline unsigned long +calc_load(unsigned long load, unsigned long exp, unsigned long active) +{ + unsigned long newload; + + newload = load * exp + active * (FIXED_1 - exp); + if (active >= load) + newload += FIXED_1-1; + + return newload / FIXED_1; +} + +#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) extern void calc_global_load(unsigned long ticks); diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 2ddfce8f1e8f..bb4fe4e1a601 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2556,16 +2556,11 @@ static int kdb_summary(int argc, const char **argv) } kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60); - /* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */ - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) kdb_printf("load avg %ld.%02ld %ld.%02ld %ld.%02ld\n", LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]), LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]), LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2])); -#undef LOAD_INT -#undef LOAD_FRAC + /* Display in kilobytes */ #define K(x) ((x) << (PAGE_SHIFT - 10)) kdb_printf("\nMemTotal: %8lu kB\nMemFree: %8lu kB\n" diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index a171c1258109..54fbdfb2d86c 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -91,21 +91,6 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) return delta; } -/* - * a1 = a0 * e + a * (1 - e) - */ -static unsigned long -calc_load(unsigned long load, unsigned long exp, unsigned long active) -{ - unsigned long newload; - - newload = load * exp + active * (FIXED_1 - exp); - if (active >= load) - newload += FIXED_1-1; - - return newload / FIXED_1; -} - #ifdef CONFIG_NO_HZ_COMMON /* * Handle NO_HZ for the global load-average. From patchwork Tue Aug 28 17:22:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578893 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF95E5A4 for ; Tue, 28 Aug 2018 17:23:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D23D1286C0 for ; Tue, 28 Aug 2018 17:23:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C64DC2A579; Tue, 28 Aug 2018 17:23:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 30BA6286C0 for ; Tue, 28 Aug 2018 17:23:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 568956B473F; Tue, 28 Aug 2018 13:23:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4F1EE6B4741; Tue, 28 Aug 2018 13:23:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 393DB6B4742; Tue, 28 Aug 2018 13:23:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f198.google.com (mail-yb0-f198.google.com [209.85.213.198]) by kanga.kvack.org (Postfix) with ESMTP id 00C836B473F for ; Tue, 28 Aug 2018 13:23:31 -0400 (EDT) Received: by mail-yb0-f198.google.com with SMTP id 189-v6so1082052ybz.11 for ; Tue, 28 Aug 2018 10:23:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=7HJ/faUwNp70pvtX2vst2/bX0oNS92n5Rlox7XiLwSM=; b=nmH2mut9zBpGpkUeEVKnyjSHKcJfmVGnaWFbttXH+tfdKPGpcgPG63BEMlv249C0ce qj18Tr/ZK6kBaMlchk24Vfs8yY6VrgI719/qm65XL2oNQxtfUZDLLPMjaN+3P1AEB5da YR0QTsASYByEFqQHuFjfj+mthIdbriCYcyq+CZku+hS9PxyuNLk2EkKOZjO5xdDQNSmw kmolU+lx+/OigG01fxQcoM7B+9cjrYBfKInZksD0jDiTJTFDLRjEfBAdIMTI5eBN2GvC AcynAWCN7z6UaYk9Sv4yYGCWGpu0W+kJ1rqSI4fDewjdOVg+/e+7b9wyLr3hy0ZzG9zg arvg== X-Gm-Message-State: APzg51AF5IlTa7npk3bbSO2d2L+JCxGcfJE7PRryFGq6lxtHd7epZ733 99dxhN8xxxdrGKqZUOVoAV7jr1fOJ+wHcDTop+Bfi5tIjYMiD9vMwSUueEp7w1X3jh05lrpzLTU CseaADCu1JWuBz9VHNB7MbsxP/fjqKLWMN7grNk7QjuQVES0a7CyWsDliyQVnESg+SQfeYY66WA OzzVlqyqSnNC6irmU2b9pI8GYjRkg3bpnPJvXPTjBERnfsgj0PKhoeeHATX0IECDuh/qn3BDNbQ Q55233SKQpTyAMSImFSnLMLpLSZzleN/uv6l/Npl6YB7vCiPg75iXTBoHU7QKJ2XYNeD/eZIHxy NzEvfEGPcKG+Gu+hBVJ5ABXnR3bV+SoW+it3J5FUSF2w3Cdb6ZhgPMBnt5v7OqUcnCPwAxwXb8n s X-Received: by 2002:a25:be49:: with SMTP id d9-v6mr1357162ybm.407.1535477010715; Tue, 28 Aug 2018 10:23:30 -0700 (PDT) X-Received: by 2002:a25:be49:: with SMTP id d9-v6mr1357119ybm.407.1535477009717; Tue, 28 Aug 2018 10:23:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477009; cv=none; d=google.com; s=arc-20160816; b=qm9wGOyEfDXd77MbguvPJZDkeUPkKPrM0FggQv2bGlT0rkdKYG1d5SiMOZrZMTCBOO Ma7BzKJ+pgYB0GLKkHZTxwimX2irGJv+Re20IAha/WIkRxtjLmhXYNooLVP/8f5aokmZ qNQb8YeLkDnbE6fIlzs+xjgaMCJE5iACxCpm9X7MVZ2MhHOI4sXWroX1svT6v4jI/CQw K5uyJ3DhJ+ZVHdwLZpGR1ca1PJj73F2ujDhBBPtNF5gVy+AvF8ydlprhGuJp+DuXNI9R vZsMxg6WIPZOBtKNWDTHyy+52ZTSkMeTd4WvnWJJ6CfsDp8xpq21nPE/q2etVYNdMwF4 E5nA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=7HJ/faUwNp70pvtX2vst2/bX0oNS92n5Rlox7XiLwSM=; b=J+Jo6clhhByXss+CbMgMYBmtHcA+BEheBOkr/45pDHitT7Jwmc1biqYVFpgPC5Z8sB uZHTJlS0tufrm7RG7d3DmViOgCz1O2/dDAP9SYkJ+PF49YTPti6wOYibLKxRjUMMSfqO 4pLHQtE6EEEqd/obHHl7ZCqHOHCLTiLGkdXAgxdRp4X0Xah/+Q/S6kuuFiYg5loKbOZX MMdF72zwNlWcpYq38zv4ElVASsHQSrgY7XAgYH1Cj1UJfAmeHu1AbIP5GPWM11k/WjZJ YZsj4my4VO9MYL8GvdmcJLPc+HDm5rD45rZS9Arb1m4dHQ9xKAFrAPTqfS7K/Ui4Dj8q pJJg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=jIhStVWD; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 201-v6sor431923ybn.159.2018.08.28.10.23.29 for (Google Transport Security); Tue, 28 Aug 2018 10:23:29 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=jIhStVWD; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=7HJ/faUwNp70pvtX2vst2/bX0oNS92n5Rlox7XiLwSM=; b=jIhStVWDliq16204BH4yFa9JfJgW30tWLG33tq/hGkX5TwlOjEM92VNzEy1W/9O7pE cBjmnCwdKz8UyEdgbMFr76tZo60U94akOVfnGXPJwGhyJ2jAI6V5EDTlW3ue7cP8Yftu SO2cJUVsKNHMkuXOGhhs3rRs17g6cMePZcqzo3fCIh0X2kk147agMiJdTUrE/1sdntV7 tnRo+E4nB7P/z/MofV9CTqpmRHNbu2EVBa3T/hj8k/wUbpTq3DsTxnyZN+8+xiuxL+gf 97uIhkbqeig4moFP26DoLBm/yoFJkz7IdLsGW1BQoCbZHX5dSVYcG0/muFaHzH2Wi18f 3xhw== X-Google-Smtp-Source: ANB0VdaNGaFHS4gzWZuIso6V3BU/IJ/P6lbdlxSdFGuKMhZo33JAZOMAwn4+D5lt5EtqKGH3jOvo9Q== X-Received: by 2002:a25:6f84:: with SMTP id k126-v6mr1349981ybc.419.1535477009345; Tue, 28 Aug 2018 10:23:29 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id p126-v6sm609517ywf.22.2018.08.28.10.23.27 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:28 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 5/9] sched: loadavg: make calc_load_n() public Date: Tue, 28 Aug 2018 13:22:54 -0400 Message-Id: <20180828172258.3185-6-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP It's going to be used in a later patch. Keep the churn separate. Signed-off-by: Johannes Weiner --- include/linux/sched/loadavg.h | 3 + kernel/sched/loadavg.c | 138 +++++++++++++++++----------------- 2 files changed, 72 insertions(+), 69 deletions(-) diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index cc9cc62bb1f8..4859bea47a7b 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -37,6 +37,9 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active) return newload / FIXED_1; } +extern unsigned long calc_load_n(unsigned long load, unsigned long exp, + unsigned long active, unsigned int n); + #define LOAD_INT(x) ((x) >> FSHIFT) #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 54fbdfb2d86c..28a516575c18 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -91,6 +91,75 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) return delta; } +/** + * fixed_power_int - compute: x^n, in O(log n) time + * + * @x: base of the power + * @frac_bits: fractional bits of @x + * @n: power to raise @x to. + * + * By exploiting the relation between the definition of the natural power + * function: x^n := x*x*...*x (x multiplied by itself for n times), and + * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i, + * (where: n_i \elem {0, 1}, the binary vector representing n), + * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is + * of course trivially computable in O(log_2 n), the length of our binary + * vector. + */ +static unsigned long +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n) +{ + unsigned long result = 1UL << frac_bits; + + if (n) { + for (;;) { + if (n & 1) { + result *= x; + result += 1UL << (frac_bits - 1); + result >>= frac_bits; + } + n >>= 1; + if (!n) + break; + x *= x; + x += 1UL << (frac_bits - 1); + x >>= frac_bits; + } + } + + return result; +} + +/* + * a1 = a0 * e + a * (1 - e) + * + * a2 = a1 * e + a * (1 - e) + * = (a0 * e + a * (1 - e)) * e + a * (1 - e) + * = a0 * e^2 + a * (1 - e) * (1 + e) + * + * a3 = a2 * e + a * (1 - e) + * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e) + * = a0 * e^3 + a * (1 - e) * (1 + e + e^2) + * + * ... + * + * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1] + * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e) + * = a0 * e^n + a * (1 - e^n) + * + * [1] application of the geometric series: + * + * n 1 - x^(n+1) + * S_n := \Sum x^i = ------------- + * i=0 1 - x + */ +unsigned long +calc_load_n(unsigned long load, unsigned long exp, + unsigned long active, unsigned int n) +{ + return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); +} + #ifdef CONFIG_NO_HZ_COMMON /* * Handle NO_HZ for the global load-average. @@ -210,75 +279,6 @@ static long calc_load_nohz_fold(void) return delta; } -/** - * fixed_power_int - compute: x^n, in O(log n) time - * - * @x: base of the power - * @frac_bits: fractional bits of @x - * @n: power to raise @x to. - * - * By exploiting the relation between the definition of the natural power - * function: x^n := x*x*...*x (x multiplied by itself for n times), and - * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i, - * (where: n_i \elem {0, 1}, the binary vector representing n), - * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is - * of course trivially computable in O(log_2 n), the length of our binary - * vector. - */ -static unsigned long -fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n) -{ - unsigned long result = 1UL << frac_bits; - - if (n) { - for (;;) { - if (n & 1) { - result *= x; - result += 1UL << (frac_bits - 1); - result >>= frac_bits; - } - n >>= 1; - if (!n) - break; - x *= x; - x += 1UL << (frac_bits - 1); - x >>= frac_bits; - } - } - - return result; -} - -/* - * a1 = a0 * e + a * (1 - e) - * - * a2 = a1 * e + a * (1 - e) - * = (a0 * e + a * (1 - e)) * e + a * (1 - e) - * = a0 * e^2 + a * (1 - e) * (1 + e) - * - * a3 = a2 * e + a * (1 - e) - * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e) - * = a0 * e^3 + a * (1 - e) * (1 + e + e^2) - * - * ... - * - * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1] - * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e) - * = a0 * e^n + a * (1 - e^n) - * - * [1] application of the geometric series: - * - * n 1 - x^(n+1) - * S_n := \Sum x^i = ------------- - * i=0 1 - x - */ -static unsigned long -calc_load_n(unsigned long load, unsigned long exp, - unsigned long active, unsigned int n) -{ - return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); -} - /* * NO_HZ can leave us missing all per-CPU ticks calling * calc_load_fold_active(), but since a NO_HZ CPU folds its delta into From patchwork Tue Aug 28 17:22:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578895 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 672A6175A for ; Tue, 28 Aug 2018 17:23:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4AFC5286C0 for ; Tue, 28 Aug 2018 17:23:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3F5282A579; Tue, 28 Aug 2018 17:23:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 157AB286C0 for ; Tue, 28 Aug 2018 17:23:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 085AB6B4741; Tue, 28 Aug 2018 13:23:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 012616B4743; Tue, 28 Aug 2018 13:23:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DCF166B4744; Tue, 28 Aug 2018 13:23:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f199.google.com (mail-yb0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id A88756B4741 for ; Tue, 28 Aug 2018 13:23:32 -0400 (EDT) Received: by mail-yb0-f199.google.com with SMTP id d202-v6so1105019ybh.0 for ; Tue, 28 Aug 2018 10:23:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=YGVWOmKVI0Wip1CfuceBVSe8DJn5GeG7PCv8icT2Z0U=; b=B8NgX8cHYq/FMSweSTkHmCdciwCEJxqx7b4Cl1W4BDGXOgYM9T3eKrzoH6zpYiVtLe O6zHdY2vrJ2ohV2eEyDwtS/julEfGTAl4qKfv88565Ae0fAgvl6KzdY8QVjM8fEIw17r 2vuTvCXkx3Zuds4G5nnPLTr9l8Tr7eJEmc42YDADdKMZ7f30ScTWvyHMxfqJO44yx3uM 7BimhTT3zPS2sIiUtue1jcB2qxOBd9V5zNAXvy6SaQS3Sg2JeslOPjvGxm5zXItQj5Zt GbkgteD23heCA6NDKZswuhoQ2HVkg1GpV9lNvMjDAJfSXNhyeOOBuSoQtgQH3zWOdHIK rHHA== X-Gm-Message-State: APzg51DAqTgGrSR0diur+3ZtM4RIT2x63JSG0wz5/TF5S0wUjhVNGlMO BF5aGphARZGBxuMdvAgqNCnh1/V7+LBZafDMylOTGe6v7l/1XLFa9WA7t5Ma37GOSelWVPkWQhX zD2kGGtp9r01T5tYCyaX/FWVYoHhoSlmaYwztnIXJ4JJTprW6F4kmTJOSrjL1KpG+Mnf8Yhv3A+ m5AkaHiQkH4Br7JN2Vizddupm8FkLCX+DFME9UjAboeaJSzG/3QCxu+rVpkgrQBohoRmW79Oa0Y Q+hx7UFsAZiXPYXCRYUqiAgXUxBPyuXSwXTdPUG3Twra0h3Xx0Ebt0sJ6ExxSnm2B4DJAyqZwMf rHXQKV6iTYq453Yo/vz1GhVqt22SboSAVz0omADc/+Rt8JQ/FwVG71zTr/IlGcXS0CVpmXM0GqX j X-Received: by 2002:a25:5388:: with SMTP id h130-v6mr1377974ybb.229.1535477012447; Tue, 28 Aug 2018 10:23:32 -0700 (PDT) X-Received: by 2002:a25:5388:: with SMTP id h130-v6mr1377930ybb.229.1535477011596; Tue, 28 Aug 2018 10:23:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477011; cv=none; d=google.com; s=arc-20160816; b=obe4Rc+51rRIdake2rZkquKodeUczIsT5cKJht4p0IiUhAx+w7DLAGi+bIj3quL2qB 8BseX09ysmQxGIIx+jGS8RIpA4s5++LvENiR8cCEH/8ILaWRGCd62KD/ZKzSXsb1uNNK XjWcRbKgKE41nTzagFzLcc1QsLcwNoTUmAB4cQsMTpVePj9ghH8MRAEIMoH4Z5WZmT7w ix5/GBuLmKpI21uWDk1aa6ne043+dtcJqm7WakxJc7AuhhWby986MTMD1ar4QlsXYVmM 4hhpPx7aKtoNUG6VUOs1ZXhYCV7IQdFYVkuoa8TPJjYBE/KscGysBiSWae0W6KC5Eo+I nCTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=YGVWOmKVI0Wip1CfuceBVSe8DJn5GeG7PCv8icT2Z0U=; b=AFd4QgVVV/pSWB9+q/z6Ya2ntdmZFrpZ1XGl15Bwxlsdeldg0jXbOelTYaLLCQogxm /Tzx1mJgvoIoXawk7IgHb0JMUjr7+OoOPA+h9l4kkSaKjURfFsgg8rhbVNm4hBO0Yi+W NOpC5c1wTCGdCBtHtg0TKLwAC1Pu1RB2XJVDfAjtASIQvydG4PeuP6Uul7+NMuyP7HqJ ju4RSvC/9MxFUEz1gh/YF4B8WwCRxxlWKwpPZlZPmRe2X/NKy/fYvPapwgzLR8w8YhXh NxzjuHxVV9ebPhswSEPte3+fvrgGKvL/jM9mVKiQDFweiPfbpEZZiXf4Li/P1ufr2y8v ++CQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=AKT3sPmO; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id p203-v6sor392906ybb.181.2018.08.28.10.23.31 for (Google Transport Security); Tue, 28 Aug 2018 10:23:31 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=AKT3sPmO; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=YGVWOmKVI0Wip1CfuceBVSe8DJn5GeG7PCv8icT2Z0U=; b=AKT3sPmOEbg9Yg5eRaVaKJegsvlZJJBcHWwMpFiJtmIxeN7ansAN3icVfmfy3FahHC SD0UmVPp1D0fMVlwf+k2KcuBedR/njVCvvA72cV+0+X8HotOSnV/ckH193eYWk+rBWlN ayfwiIlXBVREQ6I34QvF21XoIjIQ6YMwa8MX5sUFGlxVQBtse6wmdVh5t1zdkb4MpaPt pdZCwebhjeyRfwcTuJvpb/IhnAiJQ/9JrU2FICmfrXa+m5mfeGfhiN22InLUkytxqk2z ZcE/gpyXgChPMTiPnMGUFbk8+IHjLxw0HsvV2qrldDxxq/489CHpZDAyRrqivMJ/oAcO Y8Aw== X-Google-Smtp-Source: ANB0VdazHFQ+Rkv3OnVYJY2XGFTnCPRHmlKF6eG3ZuI+LsWdppTrChqLX2X3QIOtALrC2Gu+yRidXQ== X-Received: by 2002:a25:c2c5:: with SMTP id s188-v6mr1351995ybf.176.1535477011271; Tue, 28 Aug 2018 10:23:31 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id s63-v6sm570764ywd.63.2018.08.28.10.23.30 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:30 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 6/9] sched: sched.h: make rq locking and clock functions available in stats.h Date: Tue, 28 Aug 2018 13:22:55 -0400 Message-Id: <20180828172258.3185-7-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP kernel/sched/sched.h includes "stats.h" half-way through the file. The next patch introduces users of sched.h's rq locking functions and update_rq_clock() in kernel/sched/stats.h. Move those definitions up in the file so they are available in stats.h. Signed-off-by: Johannes Weiner --- kernel/sched/sched.h | 164 +++++++++++++++++++++---------------------- 1 file changed, 82 insertions(+), 82 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c7742dcc136c..eb9b1326906c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -926,6 +926,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) +extern void update_rq_clock(struct rq *rq); + static inline u64 __rq_clock_broken(struct rq *rq) { return READ_ONCE(rq->clock); @@ -1044,6 +1046,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf) #endif } +struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) + __acquires(rq->lock); + +struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) + __acquires(p->pi_lock) + __acquires(rq->lock); + +static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); +} + +static inline void +task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) + __releases(rq->lock) + __releases(p->pi_lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); + raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); +} + +static inline void +rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irqsave(&rq->lock, rf->flags); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock_irq(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irq(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_relock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_repin_lock(rq, rf); +} + +static inline void +rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irqrestore(&rq->lock, rf->flags); +} + +static inline void +rq_unlock_irq(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irq(&rq->lock); +} + +static inline void +rq_unlock(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); +} + #ifdef CONFIG_NUMA enum numa_topology_type { NUMA_DIRECT, @@ -1683,8 +1765,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count) sched_update_tick_dependency(rq); } -extern void update_rq_clock(struct rq *rq); - extern void activate_task(struct rq *rq, struct task_struct *p, int flags); extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags); @@ -1765,86 +1845,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { } static inline void sched_avg_update(struct rq *rq) { } #endif -struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) - __acquires(rq->lock); - -struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) - __acquires(p->pi_lock) - __acquires(rq->lock); - -static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); -} - -static inline void -task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) - __releases(rq->lock) - __releases(p->pi_lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); -} - -static inline void -rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock_irqsave(&rq->lock, rf->flags); - rq_pin_lock(rq, rf); -} - -static inline void -rq_lock_irq(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock_irq(&rq->lock); - rq_pin_lock(rq, rf); -} - -static inline void -rq_lock(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock(&rq->lock); - rq_pin_lock(rq, rf); -} - -static inline void -rq_relock(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock(&rq->lock); - rq_repin_lock(rq, rf); -} - -static inline void -rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock_irqrestore(&rq->lock, rf->flags); -} - -static inline void -rq_unlock_irq(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock_irq(&rq->lock); -} - -static inline void -rq_unlock(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); -} - #ifdef CONFIG_SMP #ifdef CONFIG_PREEMPT From patchwork Tue Aug 28 17:22:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578897 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DC9B9175A for ; Tue, 28 Aug 2018 17:23:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BF6A9286C0 for ; Tue, 28 Aug 2018 17:23:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B38BE2A579; Tue, 28 Aug 2018 17:23:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 32928286C0 for ; Tue, 28 Aug 2018 17:23:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C48226B4742; Tue, 28 Aug 2018 13:23:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BD3306B4743; Tue, 28 Aug 2018 13:23:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9395A6B4744; Tue, 28 Aug 2018 13:23:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f72.google.com (mail-yw1-f72.google.com [209.85.161.72]) by kanga.kvack.org (Postfix) with ESMTP id 5BE4D6B4743 for ; Tue, 28 Aug 2018 13:23:34 -0400 (EDT) Received: by mail-yw1-f72.google.com with SMTP id i63-v6so1000955ywb.3 for ; Tue, 28 Aug 2018 10:23:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=/su2Xrr3lQDX5ALOohq5KkDM3OQAwHC1+exX8OmH6ME=; b=I7ZRqgw/AYG/8YuUgkyjd7kI+8RW3RIFiM7Jb7aa5VQhbKdw+/bgxIOmPBHxwIXIqg HAY7HeENrbaE6zCvjKEd45gmloE0vYiR9Ywh2j+ePcz/e51vnS/kBTBQrhHevtrUcbID vT7s880tOhG9t15Itnb8O0+soDjEgtsysGOkutnZ8GGOetcJo/AdjDOoc6lj5LAfww2Y XoCDmIsnZV/AOEy5+Fh3PLnMOgyIj6KkRyIf3x7t1KamvTkxwGX0wI1pK9uoewLCBoe2 hw0jkfKdoR/L4O6txOP1addp589ThoAFZux2QeLc+gDwNqERrE2guMpeSfV7kbZrsemL uzaQ== X-Gm-Message-State: APzg51DniTVPBV60Zk/oTPz1DAkMHhK3IeW91AZnqNuujCAjVnOyFRWK c/kjDtAdDEg3gVDDjHJYIEdQAi9agCGgXwh7fyBgociBpo4b0TUftdRltJ6Jvomg+ltqQTE+xxo z1nSwwHnL9bAPCAR1f3UpNRV1WT/Mc7YjsYuSmLMlp0s+AnDRpjXgHaY54q7fhAWHZ9W9quRYAm gj88jCQLPZyzptvt+Czmynosf3n142smgsOxl43mKRqJXMi6jyuyaB7/uOsFKG0P07Uu0uryXsT ulL6cpIl90qb6d1jaVl4DCOjNkdDrQ5/jJZAgedL/1m+5kTZckcfYbFCkdjhXd89nWRdNUBftQm FWdaC+HAtQW/CgO0rTYF8GlSD+9F95CvpfihxiqLH8NgrEq5LgaAj4w4nv/KYrEaEtd7kQCbZgC U X-Received: by 2002:a25:483:: with SMTP id 125-v6mr1375636ybe.110.1535477014088; Tue, 28 Aug 2018 10:23:34 -0700 (PDT) X-Received: by 2002:a25:483:: with SMTP id 125-v6mr1375611ybe.110.1535477013503; Tue, 28 Aug 2018 10:23:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477013; cv=none; d=google.com; s=arc-20160816; b=nnBv1hxgYsOOtvf5HFZF6R2uFBBU+B816LLqn+RW17Ec1g+E4PbJ2oZWfhJaDUXLEW 8LgIeVHbxwLs+hmAC3NfuGRWk2TjcpyZ67I+a43yK8NXHQortsKSTKJk4P57y9AdETn6 tOZv28/9ybWHA90r/G0DMPRhQtMaJmYndRtFm44q+rS7WluGU/EKNdYNgSsz1SlmZAq0 DTc5LznhmszM7F44nVp1z86GcLgYHEybicvXveqMOGuaNnKGfiFR4mqr18cYHU/WGoIx tCfvWSSem84dnxKvMcVIfmOvhX7dQ4hylc+663xhSwTGddMaumw6Q1KrWkDMK2t7Uz3Z KRcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=/su2Xrr3lQDX5ALOohq5KkDM3OQAwHC1+exX8OmH6ME=; b=DFDGhPLVkl+6yv4KZPfcp7sNwT589YFjyEXejI0M7xtsKxcF2/jGNll7uF6Ul76DuL OaXuErIhiy+AoxtMUnqBL+JD1EOi2NlV4tF264qr/OHCCw8+gxFkQKnKwbeeEwCXmKHN bNoCooAEdR9TYpB5u81eFqQsGollqP05t5pFKANG5uZKyYDJJa5jKC11nEpEzC+PFhvN UlTpOaB6y7IulNun639fvobAgEMzKWM64dnpxBr/7t6rfbMgK+6rL5YGCTyuCpXXsR/T WnNq5cVF1wfCk4jgY5o7Y4UxPdpBsmDauaJstGrKA/NYlAX5xb5snMN2/8+YPD2jaAWV bHXw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=i3VAVTEl; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id x186-v6sor312482ywd.166.2018.08.28.10.23.33 for (Google Transport Security); Tue, 28 Aug 2018 10:23:33 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=i3VAVTEl; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=/su2Xrr3lQDX5ALOohq5KkDM3OQAwHC1+exX8OmH6ME=; b=i3VAVTElMSCvpvFUYOOStTltjA1/oebEvqEJgK2PA6DvRgVyiCOVmImPqaMHV76+HV VddU83sXNka2uP8uLhNarc3C2lzzvrReI+SbDrvTyIXsP1XFvdHCxca7/vPI1SBidGlY buQn7ZBYitfoffhdVUWFbLr+vurVJdCSRIuSjGWsRt7YkzUPvXNnKxEuNrJs5pgfggwY RrZ91V0TriiZsSKvcbz+7lBp8vJzlG7ATC8Hzb9YU5vFngGZvk7zApkQC2o8nIfrmqQF qtOwuR2A4xmDAYtk1lC721AXu4fK7f3FTCAYfodsCKjMhevrxlO/tVrioleIU4YRkO5d Luaw== X-Google-Smtp-Source: ANB0VdaGjhMHk/LoP9pxQ6zI+Ft2mph2e+kpmYnlC4jhUEnSZgoxYDjXfwKiawkL0MCpid9hWpZuYA== X-Received: by 2002:a0d:cb88:: with SMTP id n130-v6mr1380569ywd.243.1535477013307; Tue, 28 Aug 2018 10:23:33 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id r3-v6sm675063ywr.80.2018.08.28.10.23.31 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:32 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 7/9] sched: introduce this_rq_lock_irq() Date: Tue, 28 Aug 2018 13:22:56 -0400 Message-Id: <20180828172258.3185-8-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next patch is adding another site with the same pattern, so provide a convenience function for it. Signed-off-by: Johannes Weiner --- kernel/sched/core.c | 4 +--- kernel/sched/sched.h | 12 ++++++++++++ 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe365c9a08e9..61059e671fc6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4960,9 +4960,7 @@ static void do_sched_yield(void) struct rq_flags rf; struct rq *rq; - local_irq_disable(); - rq = this_rq(); - rq_lock(rq, &rf); + rq = this_rq_lock_irq(&rf); schedstat_inc(rq->yld_count); current->sched_class->yield_task(rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index eb9b1326906c..83db5de1464c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1126,6 +1126,18 @@ rq_unlock(struct rq *rq, struct rq_flags *rf) raw_spin_unlock(&rq->lock); } +static inline struct rq * +this_rq_lock_irq(struct rq_flags *rf) + __acquires(rq->lock) +{ + struct rq *rq; + + local_irq_disable(); + rq = this_rq(); + rq_lock(rq, rf); + return rq; +} + #ifdef CONFIG_NUMA enum numa_topology_type { NUMA_DIRECT, From patchwork Tue Aug 28 17:22:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578901 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BBC845A4 for ; Tue, 28 Aug 2018 17:23:56 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9C0A8287FA for ; Tue, 28 Aug 2018 17:23:56 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8FE5B2A8DB; Tue, 28 Aug 2018 17:23:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0E442287FA for ; Tue, 28 Aug 2018 17:23:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A14F16B4743; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 95EEC6B4746; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78E5E6B4748; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f198.google.com (mail-yb0-f198.google.com [209.85.213.198]) by kanga.kvack.org (Postfix) with ESMTP id 2F9E46B4743 for ; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) Received: by mail-yb0-f198.google.com with SMTP id n7-v6so1091766yba.10 for ; Tue, 28 Aug 2018 10:23:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=EaKomMUEa/NLK8sfIIP9R+ZZ6fS01PMGYs8zO1GDTuE=; b=i4IemviekIH1UvVn6UTXfsNDBA3ll1K9A1qWTkn02OkjUUsgBz+eDrdiFir4OYNTmb vn8q7uma/Z3ieQNaMTSu+PDRkvljd6xJ/ZCrCfKmi/LQG2QRdG4lNXWX5Dd74//eJxN1 3XzW4zu2TiF7tCF9+/7ssY5C3SFOdbuXf/pJA20xzdd17fy56PXA6dquRjZgwLKSY9zs YDlYHRSHdgz8zi086ErDs2iKh/dOgti0Opt/UdhmKgk2uTW9G3rnOhOAkaRwkJRzAQPW pK1rn1yHmPVm4ldE267RCwhq+ybO51PY6CrUdpa0kOZGlnjrrtOYrHEY1rP+OQD0POC/ KFrw== X-Gm-Message-State: APzg51C0p0GZ4wMjdgjeWJI4C30whvHxE2e6j/CX+LSk9PtiQleTGIuE G5XfA2AkUp1oas3KHYoEkUdFRRMK91tROXJZjW8xPG1qHsgCRpHUIqbIurJ17dTbNsFSRGxZkyb omYAVzDA1Vvr3XUP5vkLRIEjDCZ6iWGbFgPT5Ot5pXhRF6gptBhmpW3tQ/62xsMkS28mV1HSeWe B4hCQ14SaOC/dzvFcKfT2XA0szc67l8tndQX2SaAIxo6imJkxcuVwNrvSb1aePhBrkER4DEfBrW rubZQ9hJZXKZgc7V+yiiiUWvW2utMOoozP9Y392Um0K/6SmDyXg7AtOJGQeZyJSi3CL5j2P3xAo Nb2wRatI3vMkhTEdJDvlx3wjm+Jo+FRMyhZyxZ66zvrt5aF+tqe25tkQIlrsqK/FtlzXWRCkxU0 e X-Received: by 2002:a0d:cc14:: with SMTP id o20-v6mr1379999ywd.351.1535477019769; Tue, 28 Aug 2018 10:23:39 -0700 (PDT) X-Received: by 2002:a0d:cc14:: with SMTP id o20-v6mr1379873ywd.351.1535477016619; Tue, 28 Aug 2018 10:23:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477016; cv=none; d=google.com; s=arc-20160816; b=b/Scnnn92eVvIDFvUvH9+BopBgYpSpp1UBTbkO0zQht5IlL1YcJBphWUAiopn6ZTer cnjLJ2UKKSv28jw7lAGvrgCdb3jKkp+43baunXuxGhsPyP0dzNI595TiPr14zfmuQV0k DVtLC01H//8psdvpMWzGIEWh37kRAiSOQZLjexvjAbmCxOJ0TQ8JVMUg+y1sf49AILm1 95qa2oSxZwqmyh559RN+dX/66SpHosgNIFUKA1tYBKUfF8qe3TW5vQxBCP7MutMYsyQo 67pxPI6fg30q8u0RJcS0CkY3Fb0CeEI7z3zosv6SWwmqTjraAZVJFK7MmFoD08LY+edw Op9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=EaKomMUEa/NLK8sfIIP9R+ZZ6fS01PMGYs8zO1GDTuE=; b=Yu47fz/0ks9k7Q4wE5M+VVpYiq+bbfb4z5ak9Utq5iprwfU0Gh7biUAXyahuEhDtl3 rDsLap+mAscWbpSkFzqFajcLP7IvMqWVfUNBEOyoMyiv/WMxBrvW08Bu48fq/sKHdB5+ PFUibIBp12XbdjWcJnQFXQQoTX2+4sbGCV3Am48L0PLbP7LZyzpa4V1ir2h5pPLBFP9K +IoONd1+BuFQPHBgEFzjSq+pq+y5GgRgnJJboONvVDs8PEozBz9MTbK1xo9GPb5Yv7Px B0qBJc+W/x9FCDYMMXK9U8Cipnxu9f7z5b5HBlU0pG0FH7AHlkJZ+FfkqyDiAwFUa5Nn eVKw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=SDCKIVc9; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id k12-v6sor425646ybf.203.2018.08.28.10.23.36 for (Google Transport Security); Tue, 28 Aug 2018 10:23:36 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=SDCKIVc9; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=EaKomMUEa/NLK8sfIIP9R+ZZ6fS01PMGYs8zO1GDTuE=; b=SDCKIVc9Sq/m09+Shsn/8Bc7VWoznU8dG1vZilYKFWnOZqThQmEYxbk3bWFztEGQZK 3us7itvO/bnGvlnXY3gjjKrVRdRyAs/sOAkE9RzXgPhBvDhhlX9YR/HEHaik8QPUeNqV WjK2MAYDlWL9MDHX+bKCJm4iMlATWdt0Lk9MItPucpMVI9gMTlrisHyC7XRS3JA6T3B0 p5JA4jKe7e3lF4C2RRg96zu9Vcj1dR76xbyHLtbySOIcrOgKJNJgba6vS+GOca2lNd/c sB12lZ+rm0wjp7NjwDERi15xFBba4SyeRkKt1S0dbiuuzsO2XyqqGzgAA8gVJgavmaGT SquQ== X-Google-Smtp-Source: ANB0VdZfGMw0SR8DSEetsJETbKfnYWUvIwlo/8bmi+jUMuxJWoL6BnSP1Y+W4su+n6W+JVHafB1Rcw== X-Received: by 2002:a25:a0c2:: with SMTP id i2-v6mr1401814ybm.65.1535477015483; Tue, 28 Aug 2018 10:23:35 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id o74-v6sm2070072ywo.54.2018.08.28.10.23.34 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:34 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Date: Tue, 28 Aug 2018 13:22:57 -0400 Message-Id: <20180828172258.3185-9-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). v2: - stable clock tick, as per Peter - data structure layout optimization, as per Peter - fix u64 divisions on 32 bit, as per Peter - outermost psi_disabled checks, as per Peter - coding style fixes, as per Peter - just-in-time stats aggregation, as per Suren - fix task state corruption with CONFIG_PREEMPT, as per Suren - CONFIG_PSI=n build error - avoid writing p->sched_psi_wake_requeue unnecessarily - documentation & comment updates v3: - pack scheduler hotpath data into one cacheline, as per Peter and Linus - drop unnecessary SCHED_INFO dependency, as per Peter - lockless live-state aggregation, as per Peter - do_div -> div64_ul and some other cleanups, as per Peter - realtime sampling period and slipped sample handling, as per Tejun v4: - replace an unsafe cpu_curr() dereference in the aggregator by sampling active reclaimers from scheduler_tick(), as per Peter - fix several race conditions that cause the unlocked live aggregator to get ahead of the scheduler's recorded times and cause sample calculations to underflow into bogusly large time deltas, as per Suren - fix rare accounting artifacts from CPU hotplugging, as per Peter - make the aggregation loop over all states more readable, as per Peter Signed-off-by: Johannes Weiner --- Documentation/accounting/psi.txt | 64 +++ include/linux/psi.h | 28 ++ include/linux/psi_types.h | 92 +++++ include/linux/sched.h | 10 + init/Kconfig | 15 + kernel/fork.c | 4 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 12 +- kernel/sched/psi.c | 650 +++++++++++++++++++++++++++++++ kernel/sched/sched.h | 2 + kernel/sched/stats.h | 86 ++++ mm/compaction.c | 5 + mm/filemap.c | 15 +- mm/page_alloc.c | 9 + mm/vmscan.c | 9 + 15 files changed, 996 insertions(+), 6 deletions(-) create mode 100644 Documentation/accounting/psi.txt create mode 100644 include/linux/psi.h create mode 100644 include/linux/psi_types.h create mode 100644 kernel/sched/psi.c diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt new file mode 100644 index 000000000000..51e7ef14142e --- /dev/null +++ b/Documentation/accounting/psi.txt @@ -0,0 +1,64 @@ +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +In both cases, the format for CPU is as such: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 +full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios are tracked as recent trends over ten, sixty, and three +hundred second windows, which gives insight into short term events as +well as medium and long term trends. The total absolute stall time is +tracked and exported as well, to allow detection of latency spikes +which wouldn't necessarily make a dent in the time averages, or to +average trends over custom time frames. diff --git a/include/linux/psi.h b/include/linux/psi.h new file mode 100644 index 000000000000..b0daf050de58 --- /dev/null +++ b/include/linux/psi.h @@ -0,0 +1,28 @@ +#ifndef _LINUX_PSI_H +#define _LINUX_PSI_H + +#include +#include + +#ifdef CONFIG_PSI + +extern bool psi_disabled; + +void psi_init(void); + +void psi_task_change(struct task_struct *task, int clear, int set); + +void psi_memstall_tick(struct task_struct *task, int cpu); +void psi_memstall_enter(unsigned long *flags); +void psi_memstall_leave(unsigned long *flags); + +#else /* CONFIG_PSI */ + +static inline void psi_init(void) {} + +static inline void psi_memstall_enter(unsigned long *flags) {} +static inline void psi_memstall_leave(unsigned long *flags) {} + +#endif /* CONFIG_PSI */ + +#endif /* _LINUX_PSI_H */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h new file mode 100644 index 000000000000..2cf422db5d18 --- /dev/null +++ b/include/linux/psi_types.h @@ -0,0 +1,92 @@ +#ifndef _LINUX_PSI_TYPES_H +#define _LINUX_PSI_TYPES_H + +#include +#include + +#ifdef CONFIG_PSI + +/* Tracked task states */ +enum psi_task_count { + NR_IOWAIT, + NR_MEMSTALL, + NR_RUNNING, + NR_PSI_TASK_COUNTS, +}; + +/* Task state bitmasks */ +#define TSK_IOWAIT (1 << NR_IOWAIT) +#define TSK_MEMSTALL (1 << NR_MEMSTALL) +#define TSK_RUNNING (1 << NR_RUNNING) + +/* Resources that workloads could be stalled on */ +enum psi_res { + PSI_IO, + PSI_MEM, + PSI_CPU, + NR_PSI_RESOURCES, +}; + +/* + * Pressure states for each resource: + * + * SOME: Stalled tasks & working tasks + * FULL: Stalled tasks & no working tasks + */ +enum psi_states { + PSI_IO_SOME, + PSI_IO_FULL, + PSI_MEM_SOME, + PSI_MEM_FULL, + PSI_CPU_SOME, + /* Only per-CPU, to weigh the CPU in the global average: */ + PSI_NONIDLE, + NR_PSI_STATES, +}; + +struct psi_group_cpu { + /* 1st cacheline updated by the scheduler */ + + /* Aggregator needs to know of concurrent changes */ + seqcount_t seq ____cacheline_aligned_in_smp; + + /* States of the tasks belonging to this group */ + unsigned int tasks[NR_PSI_TASK_COUNTS]; + + /* Period time sampling buckets for each state of interest (ns) */ + u32 times[NR_PSI_STATES]; + + /* Time of last task change in this group (rq_clock) */ + u64 state_start; + + /* 2nd cacheline updated by the aggregator */ + + /* Delta detection against the sampling buckets */ + u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp; +}; + +struct psi_group { + /* Protects data updated during an aggregation */ + struct mutex stat_lock; + + /* Per-cpu task state & time tracking */ + struct psi_group_cpu __percpu *pcpu; + + /* Periodic aggregation state */ + u64 total_prev[NR_PSI_STATES - 1]; + u64 last_update; + u64 next_update; + struct delayed_work clock_work; + + /* Total stall times and sampled pressure averages */ + u64 total[NR_PSI_STATES - 1]; + unsigned long avg[NR_PSI_STATES - 1][3]; +}; + +#else /* CONFIG_PSI */ + +struct psi_group { }; + +#endif /* CONFIG_PSI */ + +#endif /* _LINUX_PSI_TYPES_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 43731fe51c97..87c2fe4a28b3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -710,6 +711,10 @@ struct task_struct { unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; unsigned sched_remote_wakeup:1; +#ifdef CONFIG_PSI + unsigned sched_psi_wake_requeue:1; +#endif + /* Force alignment to the next boundary: */ unsigned :0; @@ -957,6 +962,10 @@ struct task_struct { siginfo_t *last_siginfo; struct task_io_accounting ioac; +#ifdef CONFIG_PSI + /* Pressure stall state */ + unsigned int psi_flags; +#endif #ifdef CONFIG_TASK_XACCT /* Accumulated RSS usage: */ u64 acct_rss_mem1; @@ -1397,6 +1406,7 @@ extern struct pid *cad_pid; #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ +#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ diff --git a/init/Kconfig b/init/Kconfig index 041f3a022122..98d59bc268df 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -455,6 +455,21 @@ config TASK_IO_ACCOUNTING Say N if unsure. +config PSI + bool "Pressure stall information tracking" + help + Collect metrics that indicate how overcommitted the CPU, memory, + and IO capacity are in the system. + + If you say Y here, the kernel will create /proc/pressure/ with the + pressure statistics files cpu, memory, and io. These will indicate + the share of walltime in which some or all tasks in the system are + delayed due to contention of the respective resource. + + For more details see Documentation/accounting/psi.txt. + + Say N if unsure. + endmenu # "CPU/Task time and stats accounting" config CPU_ISOLATION diff --git a/kernel/fork.c b/kernel/fork.c index 1b27babc4c78..f6cd2dd13db8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1736,6 +1736,10 @@ static __latent_entropy struct task_struct *copy_process( p->default_timer_slack_ns = current->timer_slack_ns; +#ifdef CONFIG_PSI + p->psi_flags = 0; +#endif + task_io_accounting_init(&p->ioac); acct_clear_integrals(p); diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index d9a02b318108..b29bc18f2704 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_CPU_ISOLATION) += isolation.o +obj-$(CONFIG_PSI) += psi.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 61059e671fc6..0fa008c43400 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -744,8 +744,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & ENQUEUE_NOCLOCK)) update_rq_clock(rq); - if (!(flags & ENQUEUE_RESTORE)) + if (!(flags & ENQUEUE_RESTORE)) { sched_info_queued(rq, p); + psi_enqueue(p, flags & ENQUEUE_WAKEUP); + } p->sched_class->enqueue_task(rq, p, flags); } @@ -755,8 +757,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & DEQUEUE_NOCLOCK)) update_rq_clock(rq); - if (!(flags & DEQUEUE_SAVE)) + if (!(flags & DEQUEUE_SAVE)) { sched_info_dequeued(rq, p); + psi_dequeue(p, flags & DEQUEUE_SLEEP); + } p->sched_class->dequeue_task(rq, p, flags); } @@ -2060,6 +2064,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; + psi_ttwu_dequeue(p); set_task_cpu(p, cpu); } @@ -3078,6 +3083,7 @@ void scheduler_tick(void) curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); calc_global_load_tick(rq); + psi_task_tick(rq); rq_unlock(rq, &rf); @@ -6110,6 +6116,8 @@ void __init sched_init(void) init_schedstats(); + psi_init(); + scheduler_running = 1; } diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c new file mode 100644 index 000000000000..92489e66840b --- /dev/null +++ b/kernel/sched/psi.c @@ -0,0 +1,650 @@ +/* + * Pressure stall information for CPU, memory and IO + * + * Copyright (c) 2018 Facebook, Inc. + * Author: Johannes Weiner + * + * When CPU, memory and IO are contended, tasks experience delays that + * reduce throughput and introduce latencies into the workload. Memory + * and IO contention, in addition, can cause a full loss of forward + * progress in which the CPU goes idle. + * + * This code aggregates individual task delays into resource pressure + * metrics that indicate problems with both workload health and + * resource utilization. + * + * Model + * + * The time in which a task can execute on a CPU is our baseline for + * productivity. Pressure expresses the amount of time in which this + * potential cannot be realized due to resource contention. + * + * This concept of productivity has two components: the workload and + * the CPU. To measure the impact of pressure on both, we define two + * contention states for a resource: SOME and FULL. + * + * In the SOME state of a given resource, one or more tasks are + * delayed on that resource. This affects the workload's ability to + * perform work, but the CPU may still be executing other tasks. + * + * In the FULL state of a given resource, all non-idle tasks are + * delayed on that resource such that nobody is advancing and the CPU + * goes idle. This leaves both workload and CPU unproductive. + * + * (Naturally, the FULL state doesn't exist for the CPU resource.) + * + * SOME = nr_delayed_tasks != 0 + * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0 + * + * The percentage of wallclock time spent in those compound stall + * states gives pressure numbers between 0 and 100 for each resource, + * where the SOME percentage indicates workload slowdowns and the FULL + * percentage indicates reduced CPU utilization: + * + * %SOME = time(SOME) / period + * %FULL = time(FULL) / period + * + * Multiple CPUs + * + * The more tasks and available CPUs there are, the more work can be + * performed concurrently. This means that the potential that can go + * unrealized due to resource contention *also* scales with non-idle + * tasks and CPUs. + * + * Consider a scenario where 257 number crunching tasks are trying to + * run concurrently on 256 CPUs. If we simply aggregated the task + * states, we would have to conclude a CPU SOME pressure number of + * 100%, since *somebody* is waiting on a runqueue at all + * times. However, that is clearly not the amount of contention the + * workload is experiencing: only one out of 256 possible exceution + * threads will be contended at any given time, or about 0.4%. + * + * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any + * given time *one* of the tasks is delayed due to a lack of memory. + * Again, looking purely at the task state would yield a memory FULL + * pressure number of 0%, since *somebody* is always making forward + * progress. But again this wouldn't capture the amount of execution + * potential lost, which is 1 out of 4 CPUs, or 25%. + * + * To calculate wasted potential (pressure) with multiple processors, + * we have to base our calculation on the number of non-idle tasks in + * conjunction with the number of available CPUs, which is the number + * of potential execution threads. SOME becomes then the proportion of + * delayed tasks to possibe threads, and FULL is the share of possible + * threads that are unproductive due to delays: + * + * threads = min(nr_nonidle_tasks, nr_cpus) + * SOME = min(nr_delayed_tasks / threads, 1) + * FULL = (threads - min(nr_running_tasks, threads)) / threads + * + * For the 257 number crunchers on 256 CPUs, this yields: + * + * threads = min(257, 256) + * SOME = min(1 / 256, 1) = 0.4% + * FULL = (256 - min(257, 256)) / 256 = 0% + * + * For the 1 out of 4 memory-delayed tasks, this yields: + * + * threads = min(4, 4) + * SOME = min(1 / 4, 1) = 25% + * FULL = (4 - min(3, 4)) / 4 = 25% + * + * [ Substitute nr_cpus with 1, and you can see that it's a natural + * extension of the single-CPU model. ] + * + * Implementation + * + * To assess the precise time spent in each such state, we would have + * to freeze the system on task changes and start/stop the state + * clocks accordingly. Obviously that doesn't scale in practice. + * + * Because the scheduler aims to distribute the compute load evenly + * among the available CPUs, we can track task state locally to each + * CPU and, at much lower frequency, extrapolate the global state for + * the cumulative stall times and the running averages. + * + * For each runqueue, we track: + * + * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0) + * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu]) + * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0) + * + * and then periodically aggregate: + * + * tNONIDLE = sum(tNONIDLE[i]) + * + * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE + * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE + * + * %SOME = tSOME / period + * %FULL = tFULL / period + * + * This gives us an approximation of pressure that is practical + * cost-wise, yet way more sensitive and accurate than periodic + * sampling of the aggregate task states would be. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "sched.h" + +static int psi_bug __read_mostly; + +bool psi_disabled __read_mostly; +core_param(psi_disabled, psi_disabled, bool, 0644); + +/* Running averages - we need to be higher-res than loadavg */ +#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ +#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ +#define EXP_60s 1981 /* 1/exp(2s/60s) */ +#define EXP_300s 2034 /* 1/exp(2s/300s) */ + +/* Sampling frequency in nanoseconds */ +static u64 psi_period __read_mostly; + +/* System-level pressure and stall tracking */ +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu); +static struct psi_group psi_system = { + .pcpu = &system_group_pcpu, +}; + +static void psi_clock(struct work_struct *work); + +static void group_init(struct psi_group *group) +{ + int cpu; + + for_each_possible_cpu(cpu) + seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); + group->next_update = sched_clock() + psi_period; + INIT_DELAYED_WORK(&group->clock_work, psi_clock); + mutex_init(&group->stat_lock); +} + +void __init psi_init(void) +{ + if (psi_disabled) + return; + + psi_period = jiffies_to_nsecs(PSI_FREQ); + group_init(&psi_system); +} + +static bool test_state(unsigned int *tasks, enum psi_states state) +{ + switch (state) { + case PSI_IO_SOME: + return tasks[NR_IOWAIT]; + case PSI_IO_FULL: + return tasks[NR_IOWAIT] && !tasks[NR_RUNNING]; + case PSI_MEM_SOME: + return tasks[NR_MEMSTALL]; + case PSI_MEM_FULL: + return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]; + case PSI_CPU_SOME: + return tasks[NR_RUNNING] > 1; + case PSI_NONIDLE: + return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || + tasks[NR_RUNNING]; + default: + return false; + } +} + +static u32 get_recent_time(struct psi_group *group, int cpu, + enum psi_states state) +{ + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); + unsigned int seq; + u32 time, delta; + + do { + seq = read_seqcount_begin(&groupc->seq); + + time = groupc->times[state]; + /* + * In addition to already concluded states, we also + * incorporate currently active states on the CPU, + * since states may last for many sampling periods. + * + * This way we keep our delta sampling buckets small + * (u32) and our reported pressure close to what's + * actually happening. + */ + if (test_state(groupc->tasks, state)) + time += cpu_clock(cpu) - groupc->state_start; + } while (read_seqcount_retry(&groupc->seq, seq)); + + delta = time - groupc->times_prev[state]; + groupc->times_prev[state] = time; + + return delta; +} + +static void calc_avgs(unsigned long avg[3], int missed_periods, + u64 time, u64 period) +{ + unsigned long pct; + + /* Fill in zeroes for periods of no activity */ + if (missed_periods) { + avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods); + avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods); + avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods); + } + + /* Sample the most recent active period */ + pct = div_u64(time * 100, period); + pct *= FIXED_1; + avg[0] = calc_load(avg[0], EXP_10s, pct); + avg[1] = calc_load(avg[1], EXP_60s, pct); + avg[2] = calc_load(avg[2], EXP_300s, pct); +} + +static bool update_stats(struct psi_group *group) +{ + u64 deltas[NR_PSI_STATES - 1] = { 0, }; + unsigned long missed_periods = 0; + unsigned long nonidle_total = 0; + u64 now, expires, period; + int cpu; + int s; + + mutex_lock(&group->stat_lock); + + /* + * Collect the per-cpu time buckets and average them into a + * single time sample that is normalized to wallclock time. + * + * For averaging, each CPU is weighted by its non-idle time in + * the sampling period. This eliminates artifacts from uneven + * loading, or even entirely idle CPUs. + */ + for_each_possible_cpu(cpu) { + u32 nonidle; + + nonidle = get_recent_time(group, cpu, PSI_NONIDLE); + nonidle = nsecs_to_jiffies(nonidle); + nonidle_total += nonidle; + + for (s = 0; s < PSI_NONIDLE; s++) { + u32 delta; + + delta = get_recent_time(group, cpu, s); + deltas[s] += (u64)delta * nonidle; + } + } + + /* + * Integrate the sample into the running statistics that are + * reported to userspace: the cumulative stall times and the + * decaying averages. + * + * Pressure percentages are sampled at PSI_FREQ. We might be + * called more often when the user polls more frequently than + * that; we might be called less often when there is no task + * activity, thus no data, and clock ticks are sporadic. The + * below handles both. + */ + + /* total= */ + for (s = 0; s < NR_PSI_STATES - 1; s++) + group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); + + /* avgX= */ + now = sched_clock(); + expires = group->next_update; + if (now < expires) + goto out; + if (now - expires > psi_period) + missed_periods = div_u64(now - expires, psi_period); + + /* + * The periodic clock tick can get delayed for various + * reasons, especially on loaded systems. To avoid clock + * drift, we schedule the clock in fixed psi_period intervals. + * But the deltas we sample out of the per-cpu buckets above + * are based on the actual time elapsing between clock ticks. + */ + group->next_update = expires + ((1 + missed_periods) * psi_period); + period = now - (group->last_update + (missed_periods * psi_period)); + group->last_update = now; + + for (s = 0; s < NR_PSI_STATES - 1; s++) { + u32 sample; + + sample = group->total[s] - group->total_prev[s]; + /* + * Due to the lockless sampling of the time buckets, + * recorded time deltas can slip into the next period, + * which under full pressure can result in samples in + * excess of the period length. + * + * We don't want to report non-sensical pressures in + * excess of 100%, nor do we want to drop such events + * on the floor. Instead we punt any overage into the + * future until pressure subsides. By doing this we + * don't underreport the occurring pressure curve, we + * just report it delayed by one period length. + * + * The error isn't cumulative. As soon as another + * delta slips from a period P to P+1, by definition + * it frees up its time T in P. + */ + if (sample > period) + sample = period; + group->total_prev[s] += sample; + calc_avgs(group->avg[s], missed_periods, sample, period); + } +out: + mutex_unlock(&group->stat_lock); + return nonidle_total; +} + +static void psi_clock(struct work_struct *work) +{ + struct delayed_work *dwork; + struct psi_group *group; + bool nonidle; + + dwork = to_delayed_work(work); + group = container_of(dwork, struct psi_group, clock_work); + + /* + * If there is task activity, periodically fold the per-cpu + * times and feed samples into the running averages. If things + * are idle and there is no data to process, stop the clock. + * Once restarted, we'll catch up the running averages in one + * go - see calc_avgs() and missed_periods. + */ + + nonidle = update_stats(group); + + if (nonidle) { + unsigned long delay = 0; + u64 now; + + now = sched_clock(); + if (group->next_update > now) + delay = nsecs_to_jiffies(group->next_update - now) + 1; + schedule_delayed_work(dwork, delay); + } +} + +static void record_times(struct psi_group_cpu *groupc, int cpu, + bool memstall_tick) +{ + u32 delta; + u64 now; + + now = cpu_clock(cpu); + delta = now - groupc->state_start; + groupc->state_start = now; + + if (test_state(groupc->tasks, PSI_IO_SOME)) { + groupc->times[PSI_IO_SOME] += delta; + if (test_state(groupc->tasks, PSI_IO_FULL)) + groupc->times[PSI_IO_FULL] += delta; + } + + if (test_state(groupc->tasks, PSI_MEM_SOME)) { + groupc->times[PSI_MEM_SOME] += delta; + if (test_state(groupc->tasks, PSI_MEM_FULL)) + groupc->times[PSI_MEM_FULL] += delta; + else if (memstall_tick) { + u32 sample; + /* + * Since we care about lost potential, a + * memstall is FULL when there are no other + * working tasks, but also when the CPU is + * actively reclaiming and nothing productive + * could run even if it were runnable. + * + * When the timer tick sees a reclaiming CPU, + * regardless of runnable tasks, sample a FULL + * tick (or less if it hasn't been a full tick + * since the last state change). + */ + sample = min(delta, (u32)jiffies_to_nsecs(1)); + groupc->times[PSI_MEM_FULL] += sample; + } + } + + if (test_state(groupc->tasks, PSI_CPU_SOME)) + groupc->times[PSI_CPU_SOME] += delta; + + if (test_state(groupc->tasks, PSI_NONIDLE)) + groupc->times[PSI_NONIDLE] += delta; +} + +static void psi_group_change(struct psi_group *group, int cpu, + unsigned int clear, unsigned int set) +{ + struct psi_group_cpu *groupc; + unsigned int t, m; + + groupc = per_cpu_ptr(group->pcpu, cpu); + + /* + * First we assess the aggregate resource states this CPU's + * tasks have been in since the last change, and account any + * SOME and FULL time these may have resulted in. + * + * Then we update the task counts according to the state + * change requested through the @clear and @set bits. + */ + write_seqcount_begin(&groupc->seq); + + record_times(groupc, cpu, false); + + for (t = 0, m = clear; m; m &= ~(1 << t), t++) { + if (!(m & (1 << t))) + continue; + if (groupc->tasks[t] == 0 && !psi_bug) { + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n", + cpu, t, groupc->tasks[0], + groupc->tasks[1], groupc->tasks[2], + clear, set); + psi_bug = 1; + } + groupc->tasks[t]--; + } + + for (t = 0; set; set &= ~(1 << t), t++) + if (set & (1 << t)) + groupc->tasks[t]++; + + write_seqcount_end(&groupc->seq); + + if (!delayed_work_pending(&group->clock_work)) + schedule_delayed_work(&group->clock_work, PSI_FREQ); +} + +void psi_task_change(struct task_struct *task, int clear, int set) +{ + int cpu = task_cpu(task); + + if (!task->pid) + return; + + if (((task->psi_flags & set) || + (task->psi_flags & clear) != clear) && + !psi_bug) { + printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n", + task->pid, task->comm, cpu, + task->psi_flags, clear, set); + psi_bug = 1; + } + + task->psi_flags &= ~clear; + task->psi_flags |= set; + + psi_group_change(&psi_system, cpu, clear, set); +} + +void psi_memstall_tick(struct task_struct *task, int cpu) +{ + struct psi_group_cpu *groupc; + + groupc = per_cpu_ptr(psi_system.pcpu, cpu); + write_seqcount_begin(&groupc->seq); + record_times(groupc, cpu, true); + write_seqcount_end(&groupc->seq); +} + +/** + * psi_memstall_enter - mark the beginning of a memory stall section + * @flags: flags to handle nested sections + * + * Marks the calling task as being stalled due to a lack of memory, + * such as waiting for a refault or performing reclaim. + */ +void psi_memstall_enter(unsigned long *flags) +{ + struct rq_flags rf; + struct rq *rq; + + if (psi_disabled) + return; + + *flags = current->flags & PF_MEMSTALL; + if (*flags) + return; + /* + * PF_MEMSTALL setting & accounting needs to be atomic wrt + * changes to the task's scheduling state, otherwise we can + * race with CPU migration. + */ + rq = this_rq_lock_irq(&rf); + + current->flags |= PF_MEMSTALL; + psi_task_change(current, 0, TSK_MEMSTALL); + + rq_unlock_irq(rq, &rf); +} + +/** + * psi_memstall_leave - mark the end of an memory stall section + * @flags: flags to handle nested memdelay sections + * + * Marks the calling task as no longer stalled due to lack of memory. + */ +void psi_memstall_leave(unsigned long *flags) +{ + struct rq_flags rf; + struct rq *rq; + + if (psi_disabled) + return; + + if (*flags) + return; + /* + * PF_MEMSTALL clearing & accounting needs to be atomic wrt + * changes to the task's scheduling state, otherwise we could + * race with CPU migration. + */ + rq = this_rq_lock_irq(&rf); + + current->flags &= ~PF_MEMSTALL; + psi_task_change(current, TSK_MEMSTALL, 0); + + rq_unlock_irq(rq, &rf); +} + +static int psi_show(struct seq_file *m, struct psi_group *group, + enum psi_res res) +{ + int full; + + if (psi_disabled) + return -EOPNOTSUPP; + + update_stats(group); + + for (full = 0; full < 2 - (res == PSI_CPU); full++) { + unsigned long avg[3]; + u64 total; + int w; + + for (w = 0; w < 3; w++) + avg[w] = group->avg[res * 2 + full][w]; + total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC); + + seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", + full ? "full" : "some", + LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), + LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), + LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), + total); + } + + return 0; +} + +static int psi_io_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IO); +} + +static int psi_memory_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_MEM); +} + +static int psi_cpu_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_CPU); +} + +static int psi_io_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_io_show, NULL); +} + +static int psi_memory_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_memory_show, NULL); +} + +static int psi_cpu_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_cpu_show, NULL); +} + +static const struct file_operations psi_io_fops = { + .open = psi_io_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static const struct file_operations psi_memory_fops = { + .open = psi_memory_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static const struct file_operations psi_cpu_fops = { + .open = psi_cpu_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init psi_proc_init(void) +{ + proc_mkdir("pressure", NULL); + proc_create("pressure/io", 0, NULL, &psi_io_fops); + proc_create("pressure/memory", 0, NULL, &psi_memory_fops); + proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops); + return 0; +} +module_init(psi_proc_init); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 83db5de1464c..25c5538647ad 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -54,6 +54,7 @@ #include #include #include +#include #include #include #include @@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu); #ifdef CONFIG_CGROUP_SCHED #include +#include struct cfs_rq; struct rt_rq; diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8aea199a39b4..2e07d8f59b3e 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -55,6 +55,92 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt # define schedstat_val_or_zero(var) 0 #endif /* CONFIG_SCHEDSTATS */ +#ifdef CONFIG_PSI +/* + * PSI tracks state that persists across sleeps, such as iowaits and + * memory stalls. As a result, it has to distinguish between sleeps, + * where a task's runnable state changes, and requeues, where a task + * and its state are being moved between CPUs and runqueues. + */ +static inline void psi_enqueue(struct task_struct *p, bool wakeup) +{ + int clear = 0, set = TSK_RUNNING; + + if (psi_disabled) + return; + + if (!wakeup || p->sched_psi_wake_requeue) { + if (p->flags & PF_MEMSTALL) + set |= TSK_MEMSTALL; + if (p->sched_psi_wake_requeue) + p->sched_psi_wake_requeue = 0; + } else { + if (p->in_iowait) + clear |= TSK_IOWAIT; + } + + psi_task_change(p, clear, set); +} + +static inline void psi_dequeue(struct task_struct *p, bool sleep) +{ + int clear = TSK_RUNNING, set = 0; + + if (psi_disabled) + return; + + if (!sleep) { + if (p->flags & PF_MEMSTALL) + clear |= TSK_MEMSTALL; + } else { + if (p->in_iowait) + set |= TSK_IOWAIT; + } + + psi_task_change(p, clear, set); +} + +static inline void psi_ttwu_dequeue(struct task_struct *p) +{ + if (psi_disabled) + return; + /* + * Is the task being migrated during a wakeup? Make sure to + * deregister its sleep-persistent psi states from the old + * queue, and let psi_enqueue() know it has to requeue. + */ + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) { + struct rq_flags rf; + struct rq *rq; + int clear = 0; + + if (p->in_iowait) + clear |= TSK_IOWAIT; + if (p->flags & PF_MEMSTALL) + clear |= TSK_MEMSTALL; + + rq = __task_rq_lock(p, &rf); + psi_task_change(p, clear, 0); + p->sched_psi_wake_requeue = 1; + __task_rq_unlock(rq, &rf); + } +} + +static inline void psi_task_tick(struct rq *rq) +{ + if (psi_disabled) + return; + + if (unlikely(rq->curr->flags & PF_MEMSTALL)) + psi_memstall_tick(rq->curr, rq->cpu); +} +#else /* CONFIG_PSI */ +static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} +static inline void psi_dequeue(struct task_struct *p, bool sleep) {} +static inline void psi_ttwu_dequeue(struct task_struct *p) {} +static inline void psi_task_tick(struct rq *rq) {} +#endif /* CONFIG_PSI */ + #ifdef CONFIG_SCHED_INFO static inline void sched_info_reset_dequeued(struct task_struct *t) { diff --git a/mm/compaction.c b/mm/compaction.c index faca45ebe62d..7c607479de4a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" #ifdef CONFIG_COMPACTION @@ -2068,11 +2069,15 @@ static int kcompactd(void *p) pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1; while (!kthread_should_stop()) { + unsigned long pflags; + trace_mm_compaction_kcompactd_sleep(pgdat->node_id); wait_event_freezable(pgdat->kcompactd_wait, kcompactd_work_requested(pgdat)); + psi_memstall_enter(&pflags); kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); } return 0; diff --git a/mm/filemap.c b/mm/filemap.c index ca895ebe43ac..5d27f7f51aa4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -37,6 +37,7 @@ #include #include #include +#include #include "internal.h" #define CREATE_TRACE_POINTS @@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; bool thrashing = false; + unsigned long pflags; int ret = 0; - if (bit_nr == PG_locked && !PageSwapBacked(page) && + if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { - delayacct_thrashing_start(); + if (!PageSwapBacked(page)) + delayacct_thrashing_start(); + psi_memstall_enter(&pflags); thrashing = true; } @@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, finish_wait(q, wait); - if (thrashing) - delayacct_thrashing_end(); + if (thrashing) { + if (!PageSwapBacked(page)) + delayacct_thrashing_end(); + psi_memstall_leave(&pflags); + } /* * A signal could leave PageWaiters set. Clearing it here if diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a790ef4be74e..2974b92273e0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -67,6 +67,7 @@ #include #include #include +#include #include #include @@ -3549,15 +3550,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, enum compact_priority prio, enum compact_result *compact_result) { struct page *page; + unsigned long pflags; unsigned int noreclaim_flag; if (!order) return NULL; + psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save(); + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, prio); + memalloc_noreclaim_restore(noreclaim_flag); + psi_memstall_leave(&pflags); if (*compact_result <= COMPACT_INACTIVE) return NULL; @@ -3756,11 +3762,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, struct reclaim_state reclaim_state; int progress; unsigned int noreclaim_flag; + unsigned long pflags; cond_resched(); /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); + psi_memstall_enter(&pflags); fs_reclaim_acquire(gfp_mask); noreclaim_flag = memalloc_noreclaim_save(); reclaim_state.reclaimed_slab = 0; @@ -3772,6 +3780,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, current->reclaim_state = NULL; memalloc_noreclaim_restore(noreclaim_flag); fs_reclaim_release(gfp_mask); + psi_memstall_leave(&pflags); cond_resched(); diff --git a/mm/vmscan.c b/mm/vmscan.c index 7fdbc18fea6f..818dd786a355 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -3131,6 +3132,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, { struct zonelist *zonelist; unsigned long nr_reclaimed; + unsigned long pflags; int nid; unsigned int noreclaim_flag; struct scan_control sc = { @@ -3159,9 +3161,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, sc.gfp_mask, sc.reclaim_idx); + psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save(); + nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + memalloc_noreclaim_restore(noreclaim_flag); + psi_memstall_leave(&pflags); trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); @@ -3326,6 +3332,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) int i; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + unsigned long pflags; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -3336,6 +3343,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) .may_swap = 1, }; + psi_memstall_enter(&pflags); __fs_reclaim_acquire(); count_vm_event(PAGEOUTRUN); @@ -3437,6 +3445,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) out: snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); + psi_memstall_leave(&pflags); /* * Return the order kswapd stopped reclaiming at as * prepare_kswapd_sleep() takes it into account. If another caller From patchwork Tue Aug 28 17:22:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 10578899 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CA1235A4 for ; Tue, 28 Aug 2018 17:23:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ABE43287FA for ; Tue, 28 Aug 2018 17:23:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9FB722A8DB; Tue, 28 Aug 2018 17:23:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 89617287FA for ; Tue, 28 Aug 2018 17:23:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E9A76B4747; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 36AE56B4748; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 14BAE6B4747; Tue, 28 Aug 2018 13:23:40 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb0-f198.google.com (mail-yb0-f198.google.com [209.85.213.198]) by kanga.kvack.org (Postfix) with ESMTP id C014C6B4743 for ; Tue, 28 Aug 2018 13:23:39 -0400 (EDT) Received: by mail-yb0-f198.google.com with SMTP id 1-v6so1075317ybe.18 for ; Tue, 28 Aug 2018 10:23:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=n0TG+nvGzK++u9l0wi0N7XTfSpc/rPU78L+tO2mOeyI=; b=cYSkWv9Oz2TBiOZib+ieLTDuk2yd4yDz4tRJ7i3wVkiVeNPxFCzzLFC+DMQvOifSJq 0Nl1sGv82fDqZ0js7QeBJrbjd4RW3leIitwq4S3lAxTjplzT1p73pc+uyLUyFkh65/Pm F6GeqovXin5W6xu8kR5q27LJKTAM+0RMPUHiR2pjjtWrn2f26bcCTFZmWsXsM5TS2B7F 2Bce6QX6+uwG82Dtdha3gPVOQ9JGuoY+n/s29AWO2ExoY0zO70ur7pblmZKA+6WMQL+a gpuiTRBfheMG7HdIhHVRcwY9pYV5e9suZfxCgiaPpBBStjYvMAVvXJ/JMl9+Fgv0F645 T5tQ== X-Gm-Message-State: APzg51D0twsu3Ik1JgcBMzVZ/+wrHa5gguaioYLeQw9GnBh+vBl+Ar2U BkUanCp4DCQRRlzEHn3ww6oi1ZDzi3v0dqZbmbujaDaL+jJp5IHjsSkGRuHjsqpmUZ8ceK0jGco gtUFZ/MHmypVGz1GiO2Cs3oFlCLZHkp3aYRDn/G10y7fXNQvu2kPJfZoWkJgPUMgyhKObC7kifQ MYbE3ReHiqeGKKbipTs541wGh49lckb9WLLVz+eNIrAzM/DITP2DsfzoS9l8gUhocg/GuQ0CPOc 0Th8SmBlWXQs2LzMHf0aoc5KxW9/TS2+/Eta5NBEYKKZTZIAWVuezMBvEayX/DdeqDHLKaK3PvD 8dQ5HkhrW13czOE2WU8EoisQYTrsLgYwDrjQVOgixEMT25hvpCIFxRobmNMIL9YT/WyYQSf6959 w X-Received: by 2002:a25:4707:: with SMTP id u7-v6mr1333624yba.439.1535477019479; Tue, 28 Aug 2018 10:23:39 -0700 (PDT) X-Received: by 2002:a25:4707:: with SMTP id u7-v6mr1333573yba.439.1535477018221; Tue, 28 Aug 2018 10:23:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535477018; cv=none; d=google.com; s=arc-20160816; b=hccDQLyUDE5LJkCZH4EdfikmqXlEz1A7c4Mm0nmEVlbQraePuB7KjCNG1ulUDC333+ XTohyNQkDOO9f1RiCmmG7baZyaXI6kFbc/dx6rYboIir2b8MRTgmrsgskhyzEPyt+kxw C/0rkFMJ36BPocQyWghJjUbiDF1Xis9+TXqzPjwpeJzrgq0vxZCKS0eappTa9yLEw5yz U7HHCen4DW7DJt6t79w46BaBt3djBKD12hHAfSSqHJLWMjB+VBI+kwhP7pY4F9G4tm+1 HD/nhUm7/JGQW6nOYoul4sUmjcbvwU72ecNxXDru2Q0iOphzmNVZjjY6yxeLv5jFhRvg TMvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=n0TG+nvGzK++u9l0wi0N7XTfSpc/rPU78L+tO2mOeyI=; b=TwO/vimkeO9jInXyq/hqEoi+DCQUI+VZ5gsK+/NUEGISrPJjRd7dlz/6qaDYeVRxLL /LQ5ITf0tfHWZO3x4P6UI+MxVaV5Td64dO7dmxoadA+dP31lXwzF+XcdSIy9v6zOolVy Dhiq4W9ROcnueswQmfxa1aUpMCo3rndrHYlaBAfMGEHqZYAfhvTEvqoQocsU/s3Ijs4x EO3oeFQN7Ncg1zPPOjTp8POWzVw6JrmgG7Y7mcG7bFbyjZ9Ovyxth/LfZnbyqm0CiRNi Esbf8y/gQ/nUhdAvqM5crwCztxTDVK+cuM69V0Hn1FSL2Da9lJjAsOYNA9SDVpuYmj5q ELDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b="NrBxwEr/"; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id i185-v6sor395347ybc.106.2018.08.28.10.23.38 for (Google Transport Security); Tue, 28 Aug 2018 10:23:38 -0700 (PDT) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b="NrBxwEr/"; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=n0TG+nvGzK++u9l0wi0N7XTfSpc/rPU78L+tO2mOeyI=; b=NrBxwEr/3IEvJyyyBQVGfMzRTknogbIfTXQAwsMfkkv1/+YBfY2MvsTEteRXHcyXTW kXXLnTFLXd4oV8JErR4UyYPbNvGe7mt0Yqt8GVz6ahO9KLLVloQaccxOujHP+HfFOppm cO6+1vYcA+QVqxDcQE7oGv87vauxzE/MZkC6fv4Tm4tuOlgOHB5Qm3xU1cMmIc5ZSN35 /muhhCz51qC+HKwYCY+y1YXKonbJI6t2U+P9/E8hoBPiCd2N1VlH0U9/fbMyrW+/z2yS IoY0IzvWV6xRZ1KALO72GIMvIuIXavdv3vffRf0lDTDcTENvcRrFVMSiO2S96UejBG1d h1ag== X-Google-Smtp-Source: ANB0VdbU7J2KdBp3Grug8sFQOT0N30S9S/mghQQP7rROSAAQGDVyCMm4VZa922llHuxPV6wUAae88w== X-Received: by 2002:a25:61d5:: with SMTP id v204-v6mr1364177ybb.97.1535477017722; Tue, 28 Aug 2018 10:23:37 -0700 (PDT) Received: from localhost ([2620:10d:c091:200::1:de86]) by smtp.gmail.com with ESMTPSA id t124-v6sm1521496ywt.105.2018.08.28.10.23.36 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 28 Aug 2018 10:23:36 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Peter Enderborg , Shakeel Butt , Mike Galbraith , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 9/9] psi: cgroup support Date: Tue, 28 Aug 2018 13:22:58 -0400 Message-Id: <20180828172258.3185-10-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828172258.3185-1-hannes@cmpxchg.org> References: <20180828172258.3185-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. v3: - fix copy-paste indentation screwups v4: - propagate psi_disabled checks outward - factor out iterate_groups() Acked-by: Tejun Heo Signed-off-by: Johannes Weiner --- Documentation/accounting/psi.txt | 9 ++ Documentation/admin-guide/cgroup-v2.rst | 18 ++++ include/linux/cgroup-defs.h | 4 + include/linux/cgroup.h | 15 +++ include/linux/psi.h | 25 +++++ init/Kconfig | 4 + kernel/cgroup/cgroup.c | 45 ++++++++- kernel/sched/psi.c | 118 ++++++++++++++++++++++-- 8 files changed, 228 insertions(+), 10 deletions(-) diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt index 51e7ef14142e..e051810d5127 100644 --- a/Documentation/accounting/psi.txt +++ b/Documentation/accounting/psi.txt @@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is tracked and exported as well, to allow detection of latency spikes which wouldn't necessarily make a dent in the time averages, or to average trends over custom time frames. + +Cgroup2 interface +================= + +In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem +mounted, pressure stall information is also tracked for tasks grouped +into cgroups. Each subdirectory in the cgroupfs mountpoint contains +cpu.pressure, memory.pressure, and io.pressure files; the format is +the same as the /proc/pressure/ files. diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8a2c52d5c53b..02cb308ea400 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -963,6 +963,12 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + cpu.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for CPU. See + Documentation/accounting/psi.txt for details. + Memory ------ @@ -1250,6 +1256,12 @@ PAGE_SIZE multiple when read back. higher than the limit for an extended period of time. This reduces the impact on the workload and memory management. + memory.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for memory. See + Documentation/accounting/psi.txt for details. + Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1385,6 +1397,12 @@ IO Interface Files 8:16 rbps=2097152 wbps=max riops=max wiops=max + io.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for IO. See + Documentation/accounting/psi.txt for details. + Writeback ~~~~~~~~~ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index c0e68f903011..f4be871ca169 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -20,6 +20,7 @@ #include #include #include +#include #ifdef CONFIG_CGROUPS @@ -435,6 +436,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + /* used to track pressure stalls */ + struct psi_group psi; + /* used to store eBPF programs */ struct cgroup_bpf bpf; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index c9fdf6f57913..7b667a89704b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp) pr_cont_kernfs_path(cgrp->kn); } +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return &cgrp->psi; +} + static inline void cgroup_init_kthreadd(void) { /* @@ -680,6 +685,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp) return NULL; } +static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) +{ + return NULL; +} + +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return NULL; +} + static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { diff --git a/include/linux/psi.h b/include/linux/psi.h index b0daf050de58..8e0725aac0aa 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -4,6 +4,9 @@ #include #include +struct seq_file; +struct css_set; + #ifdef CONFIG_PSI extern bool psi_disabled; @@ -16,6 +19,14 @@ void psi_memstall_tick(struct task_struct *task, int cpu); void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); +int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); + +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgrp); +void psi_cgroup_free(struct cgroup *cgrp); +void cgroup_move_task(struct task_struct *p, struct css_set *to); +#endif + #else /* CONFIG_PSI */ static inline void psi_init(void) {} @@ -23,6 +34,20 @@ static inline void psi_init(void) {} static inline void psi_memstall_enter(unsigned long *flags) {} static inline void psi_memstall_leave(unsigned long *flags) {} +#ifdef CONFIG_CGROUPS +static inline int psi_cgroup_alloc(struct cgroup *cgrp) +{ + return 0; +} +static inline void psi_cgroup_free(struct cgroup *cgrp) +{ +} +static inline void cgroup_move_task(struct task_struct *p, struct css_set *to) +{ + rcu_assign_pointer(p->cgroups, to); +} +#endif + #endif /* CONFIG_PSI */ #endif /* _LINUX_PSI_H */ diff --git a/init/Kconfig b/init/Kconfig index 98d59bc268df..7506dcd81d1c 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -466,6 +466,10 @@ config PSI the share of walltime in which some or all tasks in the system are delayed due to contention of the respective resource. + In kernels with cgroup support, cgroups (cgroup2 only) will + have cpu.pressure, memory.pressure, and io.pressure files, + which aggregate pressure stalls for the grouped tasks only. + For more details see Documentation/accounting/psi.txt. Say N if unsure. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 077370bf8964..ba7d3e1e3970 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -829,7 +830,7 @@ static void css_set_move_task(struct task_struct *task, */ WARN_ON_ONCE(task->flags & PF_EXITING); - rcu_assign_pointer(task->cgroups, to_cset); + cgroup_move_task(task, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } @@ -3406,6 +3407,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v) return ret; } +#ifdef CONFIG_PSI +static int cgroup_io_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO); +} +static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM); +} +static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); +} +#endif + static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of->kn->priv; @@ -4534,6 +4550,23 @@ static struct cftype cgroup_base_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cpu_stat_show, }, +#ifdef CONFIG_PSI + { + .name = "io.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_io_pressure_show, + }, + { + .name = "memory.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_memory_pressure_show, + }, + { + .name = "cpu.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_cpu_pressure_show, + }, +#endif { } /* terminate */ }; @@ -4594,6 +4627,7 @@ static void css_free_rwork_fn(struct work_struct *work) */ cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); + psi_cgroup_free(cgrp); if (cgroup_on_dfl(cgrp)) cgroup_rstat_exit(cgrp); kfree(cgrp); @@ -4850,10 +4884,15 @@ static struct cgroup *cgroup_create(struct cgroup *parent) cgrp->self.parent = &parent->self; cgrp->root = root; cgrp->level = level; - ret = cgroup_bpf_inherit(cgrp); + + ret = psi_cgroup_alloc(cgrp); if (ret) goto out_idr_free; + ret = cgroup_bpf_inherit(cgrp); + if (ret) + goto out_psi_free; + for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { cgrp->ancestor_ids[tcgrp->level] = tcgrp->id; @@ -4891,6 +4930,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent) return cgrp; +out_psi_free: + psi_cgroup_free(cgrp); out_idr_free: cgroup_idr_remove(&root->cgroup_idr, cgrp->id); out_stat_exit: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 92489e66840b..84127de49193 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -466,9 +466,35 @@ static void psi_group_change(struct psi_group *group, int cpu, schedule_delayed_work(&group->clock_work, PSI_FREQ); } +static struct psi_group *iterate_groups(struct task_struct *task, void **iter) +{ +#ifdef CONFIG_CGROUPS + struct cgroup *cgroup = NULL; + + if (!*iter) + cgroup = task->cgroups->dfl_cgrp; + else if (*iter == &psi_system) + return NULL; + else + cgroup = cgroup_parent(*iter); + + if (cgroup && cgroup_parent(cgroup)) { + *iter = cgroup; + return cgroup_psi(cgroup); + } +#else + if (*iter) + return NULL; +#endif + *iter = &psi_system; + return &psi_system; +} + void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); + struct psi_group *group; + void *iter = NULL; if (!task->pid) return; @@ -485,17 +511,23 @@ void psi_task_change(struct task_struct *task, int clear, int set) task->psi_flags &= ~clear; task->psi_flags |= set; - psi_group_change(&psi_system, cpu, clear, set); + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set); } void psi_memstall_tick(struct task_struct *task, int cpu) { - struct psi_group_cpu *groupc; + struct psi_group *group; + void *iter = NULL; - groupc = per_cpu_ptr(psi_system.pcpu, cpu); - write_seqcount_begin(&groupc->seq); - record_times(groupc, cpu, true); - write_seqcount_end(&groupc->seq); + while ((group = iterate_groups(task, &iter))) { + struct psi_group_cpu *groupc; + + groupc = per_cpu_ptr(group->pcpu, cpu); + write_seqcount_begin(&groupc->seq); + record_times(groupc, cpu, true); + write_seqcount_end(&groupc->seq); + } } /** @@ -558,8 +590,78 @@ void psi_memstall_leave(unsigned long *flags) rq_unlock_irq(rq, &rf); } -static int psi_show(struct seq_file *m, struct psi_group *group, - enum psi_res res) +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgroup) +{ + if (psi_disabled) + return 0; + + cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + if (!cgroup->psi.pcpu) + return -ENOMEM; + group_init(&cgroup->psi); + return 0; +} + +void psi_cgroup_free(struct cgroup *cgroup) +{ + if (psi_disabled) + return; + + cancel_delayed_work_sync(&cgroup->psi.clock_work); + free_percpu(cgroup->psi.pcpu); +} + +/** + * cgroup_move_task - move task to a different cgroup + * @task: the task + * @to: the target css_set + * + * Move task to a new cgroup and safely migrate its associated stall + * state between the different groups. + * + * This function acquires the task's rq lock to lock out concurrent + * changes to the task's scheduling state and - in case the task is + * running - concurrent changes to its stall state. + */ +void cgroup_move_task(struct task_struct *task, struct css_set *to) +{ + bool move_psi = !psi_disabled; + unsigned int task_flags = 0; + struct rq_flags rf; + struct rq *rq; + + if (move_psi) { + rq = task_rq_lock(task, &rf); + + if (task_on_rq_queued(task)) + task_flags = TSK_RUNNING; + else if (task->in_iowait) + task_flags = TSK_IOWAIT; + + if (task->flags & PF_MEMSTALL) + task_flags |= TSK_MEMSTALL; + + if (task_flags) + psi_task_change(task, task_flags, 0); + } + + /* + * Lame to do this here, but the scheduler cannot be locked + * from the outside, so we move cgroups from inside sched/. + */ + rcu_assign_pointer(task->cgroups, to); + + if (move_psi) { + if (task_flags) + psi_task_change(task, 0, task_flags); + + task_rq_unlock(rq, task, &rf); + } +} +#endif /* CONFIG_CGROUPS */ + +int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full;