From patchwork Tue Sep 12 18:45:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13382044 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A7CBEE3F0C for ; Tue, 12 Sep 2023 18:45:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E98D6B0136; Tue, 12 Sep 2023 14:45:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 699C76B0145; Tue, 12 Sep 2023 14:45:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53A5D6B0146; Tue, 12 Sep 2023 14:45:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3EF7E6B0136 for ; Tue, 12 Sep 2023 14:45:33 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0EC44A0D46 for ; Tue, 12 Sep 2023 18:45:33 +0000 (UTC) X-FDA: 81228823746.17.C1D381C Received: from mail-oa1-f47.google.com (mail-oa1-f47.google.com [209.85.160.47]) by imf18.hostedemail.com (Postfix) with ESMTP id 44FDC1C0015 for ; Tue, 12 Sep 2023 18:45:30 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=MyXugWok; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.160.47 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694544330; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=nzHMU3IcA2r7VBbTKS7qOgbfbe1jEYz575OlcPUFNb8=; b=JCOIVDQGP4UTffRP3Xl66wbbTm9MrrBUJJL4w34JFEfFk5fx+krK/+FWTGGk69eFswqf1Q 6J2r78iSC442zaAHdwqQDnUu2sSEF7LMso8HD1fQszxFf7J1tae73OOIdTsJyZ4MI6hUZB Vmnt6McQnU8R12DVqLneCegFxp0N+4Q= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=MyXugWok; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.160.47 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694544330; a=rsa-sha256; cv=none; b=3nnheiw4sFLEGHu5Vm/27UuUXvo7JT2ZUr3963byQLm25iz+lCBdBPZyv1sm3dBX9WAxQF hgZPsFwZSbfRO3MjPyufMnmcH/RUwf5dQaeu1WL37she9KpNm5SVqV85y1J4cjCQiVsYWo bZY/RllAPswXCcvfCwlIUTipOBA7yxs= Received: by mail-oa1-f47.google.com with SMTP id 586e51a60fabf-1c8e9d75ce1so3501660fac.3 for ; Tue, 12 Sep 2023 11:45:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694544328; x=1695149128; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=nzHMU3IcA2r7VBbTKS7qOgbfbe1jEYz575OlcPUFNb8=; b=MyXugWokeMZZxHYerGwIAl+GqPONrFSECmcOrnCDzQ/rBojI19jHNTWB9nwaPf5GwK I9LcOfRAmxVt5hOG7zb1Adx5zQViChdXgxSitgYEKSsASFu+RQYJdT1+tFrki7mjnQMS qaMuFM9aTBF8PaZ5eFmxjar/7LIPHIvtwq6kC5Afou6BvuawnyR5CGB08AphNTr57+/v u5f8f0vL6ZrJo9ApjWXnuNXAwK1fslsGB3Bu9YVg5j/ftGAdl1TJp2co2Q5TuM/eSbDg VGV8cTdHywPBf7zloMpgPYYaYTCHlowpq+8o0IgrjY451HjYSQqcItcQzdPIsLpmodeN Bv8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694544328; x=1695149128; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=nzHMU3IcA2r7VBbTKS7qOgbfbe1jEYz575OlcPUFNb8=; b=FtxmmrmfVJsk9RIC/hekP7+nRCRi977DOpluVTroqIkuGAK05b/VYU4zxhb48vvUiy b5eD1koAHo2dIlbMRPuM3yeqw60sHAyJYiJWN7YAIG+sjpFs9mTcSF8oN42ls4ZfXvAT GMGeI6YzlNaVOR3h044YMT7DAADJKCd58D6KitIBIgVcRlMpStAEu0OMimdTB91vrPyt mj899DoUPYDIEwgIfRM2FQjYges+nZOWNWiaSrZEcSgiAYdkbSXoFtS4xYPq8GCSrsh2 hZdBysHiPKAExUSq4FS0mniz+KkjbQYqt2Ci+5aM04yjKwyD4QPX0xHkOp/jstMFU7pw u+rQ== X-Gm-Message-State: AOJu0Yyrc1ZQKw42enE8RVB36xYA3yfHWrrgvS/le6DONPCAQQ5WRPY3 Ss/cjstkh26BW5UxdMMRnkDdNwHo6waiH/7D9ncd4Q== X-Google-Smtp-Source: AGHT+IHjKb4rQeVrhStvPDF5/lwtG24azz3oG6fC5/6lnzPUV1vqB0T4k4xYzTRrBIhsSiZ45ub43g== X-Received: by 2002:a05:6870:1490:b0:1d0:f5bd:6e9 with SMTP id k16-20020a056870149000b001d0f5bd06e9mr415006oab.22.1694544328460; Tue, 12 Sep 2023 11:45:28 -0700 (PDT) Received: from KASONG-MB2.tencent.com ([124.127.145.18]) by smtp.gmail.com with ESMTPSA id q18-20020a63bc12000000b00553b9e0510esm7390605pge.60.2023.09.12.11.45.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 12 Sep 2023 11:45:27 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Yu Zhao , Roman Gushchin , Johannes Weiner , Michal Hocko , Hugh Dickins , Nhat Pham , Yuanchu Xie , Suren Baghdasaryan , "T . J . Mercier" , linux-kernel@vger.kernel.orng, Kairui Song Subject: [RFC PATCH v2 0/5] Refault distance checking for MGLRU Date: Wed, 13 Sep 2023 02:45:06 +0800 Message-ID: <20230912184511.49333-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.41.0 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: 7xf346k65rkot5xw53dwpz5qntf39n4i X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 44FDC1C0015 X-HE-Tag: 1694544330-150807 X-HE-Meta: U2FsdGVkX1+ZWCRWXHKsPxpeWyHIDwlcLISn0JWdy6JYA77MWl/HnmvpcxIjHWzgECcGhd4+Eza7leSYPu1Fh2cFatwU5EaUQ0uKRhbNsrdNhewL0gBCs5v7HuAgQtWWMREmCS99mGjvALW9JFJPT7ka6y/WNDWUUGWjEOzJVeDjQnAOAH73CamgRYsV/1LMQ9m5GrdRWBFMaxopu543PWRft6qi9zVGUmAM3WyIlLGsTFWRAnuHOgrNGp6bH+yMqBVSE9lh/zcI8U+jtXu0pIxselyJJYY0/+EWqmwHUPOUS+FwfqqoRXXPSvK7vOqSW049v1czKJqua5bJtVZni/WydJoM5sqP07zLcLcKjQxYJxV+e4oEZWFp+HXX9cPE/IZPeSGi+azPQw9hXCfbvqZ4TURPPPLXhnEiRFKoaOrIqTHEGzegvzeaP2cfEcq98XBFkDZ0JWKAymN6gplp2TPElT4vQvWuwer62j4OhmmQA3cUm0Vkyi50BxuJGqNXD1kGiOjM9ET5Ixq92eN+RMj3ib+bArnNkgZLGTE0mbvzGN2EWoCbjuKTuLYgevKQWYUUxSaWKaAjqClbSnZvwAwNUiv/+i+TM6umNWI9emQtMViHUsDay9y2yj4xoRCOhmgyPm1YGevxdPlht4CNP061RjrcXqflNXnrR//CWiYcTZWqFoQhTzpmo5+r/O1QkW44c0jw88rlKyeI0WX39EZtGxfkgIVB6qW9AwcYL+7sGKfDCz+Z9ehmBbCGlXes24J/X2ZHuzGywxIpJaxWrYa2XF5+TsAn/cc+HBWFUg/g6mlHBvj5kjPsTkNVPLqwybqewOj91GQxNnvy3/uXhE0qPgWbR3PHUwBCa5qc6VcNgln9JcPXYMI/JpOFo3xBoKOWaOHChGeZzjtjIzW9iZCWiZe0KTFWd/4pL1HVlJaSLHiSGFRaWK+xZ+dC/PVtMoseW/pcLQHAWZ4AOHU TRZIPhJ+ 5noBW9rvKjKZJ9JA/nxdAI6AXHjMNhXIrKoSwuTXQTp3j4to/FsF1X0N0PEXQlwbn/Qazcbft7LrESWHlGI9aewoREIkPc3NwaaD5KPJ/voT489ciovax0RGgyMisxQoXahRAqzaf93WVY42K+y1ApZOJLomdffQYLAIVDu1VycTPSAnDkROXoUhgF2a3yfvb6CQNzGcfCriIRVanh28YpQw8dc07NHuQBMvFbfdxjKLC9mR1sNKK65aQ58qP5inGV0eyqkdtMuQwkdp0cndbX6SKbRGohz577q4AErzGJ9fCZcVgxIlcf02EZDYAZNNNDv595/DM9Uk+JAtDXpsdWcvfmuKh0WCGP1IfP3orAdi04En1rewWV4zgcYykjcxFsgS0Zjzo2MVVbZdzH4bs9sff5RlRLDEOSlfC/uKNKx6CklA/KUdZkgkoeNyrrcDOqnU8Zn4kqNtCCYrLQ2CxLo3W7pYcHLwIJ49LO15WGSxA5tDFHKVV6fgRPHaz7YF5IB8nRIX8UnBpUMmpBtvwrBHGfplMVb3XQH1QgjOjNQnZVaWq6bV98JzOuVpsJdj+CK4o78on0EG1ASk6RJ3nJyBIfJ4ry/UosZMp+Ewt7Cv+mFWg3/pdaZRwDLZNNwVgOcytQd+QdgHwfKfQHUpavijXjXbk231RkYSU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Kairui Song Hi, linux-mm I noticed MGLRU not working very well on certain workflows, which is observed on some workloads with heavy memory stress. After some debugging, I found this was related to refault distance detection, when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated or promoted, considering LRU starts from right) of file pages are larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably, few get activated and stay in memory. This series tries to fix this problem by rework the refault distance based activation to better fit MGLRU, and also tries to use a unified algorithm for both MGLRU and Inactive/Active LRU, the performance almost doubled for the workloads that are not working well previously. Patch 1/5 reworked the refault distance detection model for Inactive/Active LRU. Patch 2/5 updated the comments. Patch 3/5 and 4/5 are simplification and prepare. Patch 4/4 applies the modified refault distance detection for MGLRU. Following benchmark showed 5x improvement: To simulate the workflow, I setup a 3-replicated mongodb cluster using docker, each in a standalone cgroup, set to use 5GB of wiretiger cache and 10g of oplog, on a 32G VM. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Before the patch (with 10G swap, the result won't change whether swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 904 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 503 27150226136.4 0.02 txn/s ------------------------------------------------------------------ TOTAL 503 27150226136.4 0.02 txn/s $ cat /proc/vmstat | grep working workingset_nodes 53391 workingset_refault_anon 0 workingset_refault_file 23856735 workingset_activate_anon 0 workingset_activate_file 23845737 workingset_restore_anon 0 workingset_restore_file 18280692 workingset_nodereclaim 1024 $ free -m total used free shared buff/cache available Mem: 31837 6752 379 23 24706 24607 Swap: 10239 0 10239 After the patch (with 10G swap on same disk, similar result using ZRAM): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 903 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2575 27094953498.8 0.10 txn/s ------------------------------------------------------------------ TOTAL 2575 27094953498.8 0.10 txn/s $ cat /proc/vmstat | grep working workingset_nodes 78249 workingset_refault_anon 10139 workingset_refault_file 23001863 workingset_activate_anon 7238 workingset_activate_file 6718032 workingset_restore_anon 7432 workingset_restore_file 6719406 workingset_nodereclaim 9747 $ free -m total used free shared buff/cache available Mem: 31837 7376 320 3 24140 24014 Swap: 10239 1662 8577 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. Testing with lower stress also shows a improvement. I also checked the benchmark with memtier/memcached and fio and some other benchmarks, they looked OK so far, the results are in each commits. Sending out as RFC, I'm still trying to do more test on it, since this changed a frequently used algorithm and not really sure if there is any performance regression, it should improvement the performance for file pages in general, since it saved some operations. Update from V1: - Removed the fls operations which previously used in patch 1 for protecting active pages by expontial ratio, simply compare with number of inactive pages seems good enough. - Update some benchmarks results, test result that are basically identical as before are not updated. Kairui Song (5): workingset: simplify and use a more intuitive model workingset: update comment in workingset.c workingset: simplify lru_gen_test_recent lru_gen: convert avg_total and avg_refaulted to atomic workingset, lru_gen: apply refault-distance based re-activation include/linux/mmzone.h | 4 +- include/linux/swap.h | 2 - mm/swap.c | 1 - mm/vmscan.c | 18 +- mm/workingset.c | 411 +++++++++++++++++++++-------------------- 5 files changed, 221 insertions(+), 215 deletions(-)