From patchwork Sun Oct 27 12:07:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lance Yang X-Patchwork-Id: 13852508 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85E59D10C1B for ; Sun, 27 Oct 2024 12:08:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE4D06B0082; Sun, 27 Oct 2024 08:08:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B95806B0085; Sun, 27 Oct 2024 08:08:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5C406B0088; Sun, 27 Oct 2024 08:08:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 85ADB6B0082 for ; Sun, 27 Oct 2024 08:08:18 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id EB5A3C1B1B for ; Sun, 27 Oct 2024 12:07:54 +0000 (UTC) X-FDA: 82719258930.15.B0E39EE Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf19.hostedemail.com (Postfix) with ESMTP id D736A1A001E for ; Sun, 27 Oct 2024 12:07:48 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VVSC8MDc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730030724; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Gcz/4DTHqvkCQ2LiM7SD+jbXbuoBvYH2o1veBs5pO+8=; b=hqIRhLoJFdbsXYlXBsld6qDSXBGxNXxJkSLvvjepX7wpGSvQpBcIUS3jErwiUxdNniaf/b jYqwKG+XpHWviHpo8CgTt5QAxsvv0Yh8DkISVSHjkfKuObYBTuI5BTOAKuYoeA4TlGaZ3S r7QoCcuilMbBe0MaVBf1hfNYmIzlMeQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730030724; a=rsa-sha256; cv=none; b=Ha/Pp01mGwK52JNphPHUI90Mnz24WmQ2wXCKs3Q0DqnxPzUBG/Jq81kpbWCpOpQNHm/9Xl xkv5YyGj8coYc8RRiEfZMLpSVH8eNe2vC5hOCfnmCeylwjlrN8rI+qeSPyCO1uT52U3UwO Q+dD2ghardlHrc4T3ZT+qq7m2LB8Zes= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VVSC8MDc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=ioworker0@gmail.com Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-20cb89a4e4cso23116125ad.3 for ; Sun, 27 Oct 2024 05:08:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730030895; x=1730635695; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Gcz/4DTHqvkCQ2LiM7SD+jbXbuoBvYH2o1veBs5pO+8=; b=VVSC8MDcJMg0IsRHCixniOdjrZj6K/kutte9wS6lvOSXk8g/kt8Z9rDRHZHE42ycQM MUEUELBOwOx63ynTMmdQwlTKD1bjxYuLyzyKIuV0Gd9n8HpuM77YCv50PCJ4zgwjlnn4 nDc65azUm2fLCIXzyalw/JILbmeNePfWXL3mY+msT5xZ8lIlNF6LUMVfuIRof/D70tjY eZbSflAh0zjw4oD5+R2uVfbNNDD8i1FLqca9qZtkzy4WFsQBokQ9DVBScO2ZAISyztPp hhypmiSKedNFW9linvY68Lsar30jaKqHl9vuNhSIuZN4XKHrz+XOZaiLWIlsaObN+/S1 0dWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730030895; x=1730635695; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Gcz/4DTHqvkCQ2LiM7SD+jbXbuoBvYH2o1veBs5pO+8=; b=ODYr8maAZnUMxIT69rJ3qVxz88llJ7/ifMizeINGLtFDqgZIsfWTT6TmF1ojRC2rTF UufBdsUYqtuXMXhlHEgVBrQs/7X0nLsKxSHDAVcs6xizQ6GN9VFQ7M3ZuuaCmmJ1b2M2 StqeymV22GR/BIvUtkXrKuniN0JrtUucj0UIAJ/5XFskUC5fiD9SX1PaxN65PyC/+A4I pVnbRZj8D2CY77E8x/ntjq+o6vlXtYEBJ03+lav2pSoZXZkJf3ZvxS4FBcwRLSOWSxCI DriAYn0TSTreyI9MbB4DR2UaKmfWgceHZjvC7HtRy7cvNJruMyahwpPs3kGgeVp3GeOr 2gxA== X-Forwarded-Encrypted: i=1; AJvYcCUW5VGABiGHJtXAi8pIlGT/PFk/TV3ivK+TMAahDuKXfQaGyWHuQGyUROSssnSx6zLWHcC+qlnB3Q==@kvack.org X-Gm-Message-State: AOJu0YzBmdAkrp6ZBBDrrVlJkc+IHMGIUQXVkatnKHE34ALhi8wpjSwG HGAOXXT1E3lRAlo0JGHLeJyq0cbaDuhivHlN8fC1Hj5oVbEoXWG5VFo0Ww== X-Google-Smtp-Source: AGHT+IH6GX4gkYg2kA0Tkg8+/vZtagN6VHGAmt/oPfHaEmQBn1qj5oF3us/Xvv8X3TiTXVP4n/K+vw== X-Received: by 2002:a17:902:e851:b0:20c:94f6:3e03 with SMTP id d9443c01a7336-210c6c6a272mr62768785ad.47.1730030895066; Sun, 27 Oct 2024 05:08:15 -0700 (PDT) Received: from localhost.localdomain ([124.156.216.125]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-210bbf43476sm34897435ad.24.2024.10.27.05.08.09 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sun, 27 Oct 2024 05:08:14 -0700 (PDT) From: Lance Yang To: akpm@linux-foundation.org Cc: dj456119@gmail.com, cunhuang@tencent.com, leonylgao@tencent.com, j.granados@samsung.com, jsiddle@redhat.com, kent.overstreet@linux.dev, 21cnbao@gmail.com, ryan.roberts@arm.com, david@redhat.com, ziy@nvidia.com, libang.li@antgroup.com, baolin.wang@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, joel.granados@kernel.org, linux@weissschuh.net, Lance Yang Subject: [PATCH v2 0/2] add detect count for hung tasks Date: Sun, 27 Oct 2024 20:07:45 +0800 Message-ID: <20241027120747.42833-1-ioworker0@gmail.com> X-Mailer: git-send-email 2.45.2 MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: D736A1A001E X-Stat-Signature: k148wusm6hjigq5ewbzrdgieyqjwga35 X-Rspam-User: X-HE-Tag: 1730030868-544001 X-HE-Meta: U2FsdGVkX183V1kbCoHYDkkIxKpI/nhgO+J3TUuh4dObOLThYA5yxvlA2TTJyy056PiQgCKnKbHZmFJhQhV3VSfkCwfs6GBim2vER2vAQuF0SvubmhnoYUc9U5ilr6mbFzcQsWFt/yu3FlzyZj11GGDABo9s+/HqTDKKFEiJDXgbAQrb9pmu2sqg9jLEFVBs398CQVx6k+Z02PdbsL/xLJqyHnePBhKUpSrHTPwjpi9J7gshEa15eAgXthWJG8EuIWmlTSjq42tP/+JtTlp0iUPnSJe+35+2VKK0SzHMUBC5ShffEbF3Ejlvtg1I/KVD0AKyE8j9m9MaVmVis0TXmfF8Pv3/SCBe1cp7oh7ptPdYfZF0tLfHzqOo1Ho9MN6aq7WEYZjzSBKM4RhpE9Pbx/gl0WbE1YX+bXGFQjIr3qtkU4isgI0HdZxZXoX05V6SokRzr0oYQpgHGsmJUHQfuJpB1echcNntvsDoan4yXofaJbKYZJq+g3sSN+YNTiBB5sG9NNwXvUevf1Cr60yKz6VTaSGUGPOEI/krGinrhO/5KOM4AtlGucoPBHTRsbTNP28tvrdUby+h24dXnKXpIdwYPHVOhVxva07Ss4l4VvsQhnEmOySCZ6oeEoQ5U7Cv2SMTeF9IKDXuDFW8iIWq82LNVzt86f5QJct7lpeUFQE4Gc+790cYv1B4/D74SQKk8m4uTD+t2wyUiKRMVPJyS12vSnorc1TZWPm5BsOZaLx1NrwqfZCSzOVpi6PoisYJfWZ8DmeXgsNrp27/6mdWRxGescb3cqHsj4+S39M72HpbKyTdBnzceTakXXu4I0FtqC+g7K1yFY27GnX4O+oqPDGDfvjRLCOR/jRM9exJsPKQ9W+vVbLqKKb7jzOMw2ShAwbBCbY6BbMvdQVcWQacgoaJmuV2oZSLDRcaOsY0fQQLfEWDHQDJr59o93Hit2yVpbTuQikBgofs6X2NuMf eYtl65/b korUkGC0SaVOIWLn0ZACCFJzz2iq8eydwKwEtsmLUPp6JfI+SDSp9Cg8yMaxBIWVo1o26DBzZA4Mabpcd341giCobJYhJgBhj8upRbEgmVWQE0Se6GPSFm3cGa1HXeAPa/hazvEV+Qq3EbiUAzyH9uo1oo01uM3y04vezDi4HehBw7Lla2u+Iw2I+RSdcBjHAgwpqOY+RVE91lqVwetV7Z7+MC6+PmbZ6StaMmU2JbPNUWM/CbW2upvv0FqB4dCFoqrDPdBSREmZuES+19y8SfeFddoHZGV0qtD4Stn44+aA+f/Psl+oco0cso7lXnD5RWC4CqyRHF+cNGnt6ZdxY0jIMy8qPstmJXCZ8YxNFkfbxiTXm1oDuXDGchBs5Rf++1zaVVHjNgJ9dRJNYiY9ZvgDuMg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000016, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi all, This patchset adds a counter, hung_task_detect_count, to track the number of times hung tasks are detected. IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load average alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we plan to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-based threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense ;) --- Changes since v1 [1] ==================== - hung_task: add detect count for hung tasks - Update the changelog (per Andrew) - Find other folks to CC (per Andrew) [1] https://lore.kernel.org/linux-mm/20241022114736.83285-1-ioworker0@gmail.com Lance Yang (2): hung_task: add detect count for hung tasks hung_task: add docs for hung_task_detect_count Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++ kernel/hung_task.c | 18 ++++++++++++++++++ 2 files changed, 27 insertions(+)