From patchwork Tue Jan 2 18:46:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Uladzislau Rezki X-Patchwork-Id: 13509271 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0416C47074 for ; Tue, 2 Jan 2024 18:46:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1FA4E6B01F4; Tue, 2 Jan 2024 13:46:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1AB236B01F5; Tue, 2 Jan 2024 13:46:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 04C496B01F6; Tue, 2 Jan 2024 13:46:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E85C86B01F4 for ; Tue, 2 Jan 2024 13:46:40 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B6DF51A08B2 for ; Tue, 2 Jan 2024 18:46:40 +0000 (UTC) X-FDA: 81635252160.22.6C0D6F6 Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by imf24.hostedemail.com (Postfix) with ESMTP id C0BD6180011 for ; Tue, 2 Jan 2024 18:46:38 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="d4QdI/EB"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of urezki@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=urezki@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704221198; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; b=qVlLZ0wdQzJT20qNq52hqpP+iAApaGJr4yIe/zOQghGXZwuDBAU8701fBQqULEY1DU2ERh 7kSUKcy5HXwxaqql6uR8JmUq9LjhT7Eud/36idO2iSjf06x2jd6oU7UuqSnfGqXOXXLRMj IJDtiJFoi0O15dRgcfaR5sZhaXdpFEg= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="d4QdI/EB"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of urezki@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=urezki@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704221198; a=rsa-sha256; cv=none; b=z0wCOxnJWoxwlEvwZRe+sS00QMgGrowIZ10NDMQ7mDv87gQ1bD/5Ilsg4i04BTm0qqSdcS Guo11EONhMm7zliZMsISYBk3xVcwz0FZgsCpMK+EyKprftp5SGjyEmVV6//wlzYzI1sMK8 dwbRhaDMq6ruAeojLZJPBNccR085NpQ= Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-50ea226bda8so574356e87.2 for ; Tue, 02 Jan 2024 10:46:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704221197; x=1704825997; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; b=d4QdI/EBFD4Bab+rFF/SSV308YRgeNRlo3rVsamttbBbEqzGOFW6+hzPaMYTibs/NA V1p5ZXs9eNU3hiD1YsNoMCYSVDc2pyKiL+TeQ9ij+5sseNS5qw+gn6S4R6cJbh92YX3v Utwdub1smyZOPDlGoIgPpk2x4gqX9Rfi4jhDRbp/MXNa5CcYO5m3w60UduVJOu1fD6B6 NIOV6G51LWWKqNTZVGks9reHt93MS+632A9KyZuR/jwb0sgiGXp8LcMur6ZdozPxG9p+ LiIMPGXiTevpzPHsklj+Pl/HvrKEOmpxpzy3l5ccVfmAKnyrcEa9vF0tCcnAx+edtbIz pJ7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704221197; x=1704825997; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; b=p1jT7uxezHIOyxZQtBAg3sQ1a/HkmgdoR//lwmy5C+db+OPU/VBSoie9vo9R+jFKjH Ngl0KYtX6AAfg4omjVmdLshJtfxu8dxGi/e7Nbmxn1M50JJBuSJU76cew+H/WNXdV9qt gZ02LPhG7Ezw/daZw9oJcdRWJ8Zjm1WQqm1qhn2zrl0dYZY2zl7TXQ1XrOv0YCcaww07 Pr43Nzkpl8GQ7GxJmEOekGx9czkkIun9O9gk/V2dEjC93fpZEMeu0uxdEX6YQad4Y8Lm cj9pqYxXtET3ryzlLIUxmjASGdhlE+kpICR64lLRaPx1F0SYKnf78cqmvb26TtyvhIF+ MGMQ== X-Gm-Message-State: AOJu0YwuQQPkOBpB4MSvnKoH9+amsk9QDY8ZK2DQWuZ1O4t6csp/7NMN nFMQKulSq2FDWlgpuKswqf4+lzSfTwM= X-Google-Smtp-Source: AGHT+IFWYb1ncMhcjIs/fxbDt2YjLDxGOpX1UICltB3mEefawrDEP2mix9uaufRKlmyGJtgHDUZi1A== X-Received: by 2002:a19:6516:0:b0:50e:3dcd:3ae9 with SMTP id z22-20020a196516000000b0050e3dcd3ae9mr6007459lfb.78.1704221196597; Tue, 02 Jan 2024 10:46:36 -0800 (PST) Received: from pc638.lan (host-185-121-47-193.sydskane.nu. [185.121.47.193]) by smtp.gmail.com with ESMTPSA id q1-20020ac246e1000000b0050e7be886d9sm2592656lfo.56.2024.01.02.10.46.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 10:46:36 -0800 (PST) From: "Uladzislau Rezki (Sony)" To: linux-mm@kvack.org, Andrew Morton Cc: LKML , Baoquan He , Lorenzo Stoakes , Christoph Hellwig , Matthew Wilcox , "Liam R . Howlett" , Dave Chinner , "Paul E . McKenney" , Joel Fernandes , Uladzislau Rezki , Oleksiy Avramchenko Subject: [PATCH v3 00/11] Mitigate a vmap lock contention v3 Date: Tue, 2 Jan 2024 19:46:22 +0100 Message-Id: <20240102184633.748113-1-urezki@gmail.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C0BD6180011 X-Stat-Signature: nignqbihspw99tu3rcf5i9kug6pmkq9e X-HE-Tag: 1704221198-521103 X-HE-Meta: U2FsdGVkX1/Huk4z0p/XaqlUrUCdj26lE/C3AfHe5uPSUgKuNe7G4ha2YP6k4OhtZQ5+zAM+6pm5yIXxYuxraYiVFnialKJiZIZcc4tyakeF1P1ukiSRo4A0H+6vL9z5R34Sss3MhFoTPQbcnSWAwUM/ppITX8n22FjuAKO4QviPGX1yQuoF35S9nzEplmQmadlxR/aJbH6spal8rBGP/RWFXD3aRI63CoD2HQBhXw8oePIOVadfv5kVCRUk3yVQlhstGsjz7SkqAC9J3+JRAJuTt4Udw9t7UBW0Asp2/E17UgvlQ4igVSvpr0sps0NeneNLdUZUTxRGv1WtvADyQuMECkFWlA3/K7lIo5cg2FVUc+RPK04wb00Tf0qJS9j0OTT8AOmb9HsAFar1HqG+Y1MvlN/IwcmWMO++MvdfOBOLcikp/aTPWcXS2iIJtMCfAi7ZgOi9xuMAyleJ01F9Mg6DCIIRhXAO5WyKXGXCX/N2oswwlNug06ll8drbj8YcKKENm8THPfY+cIDNRmEI2jlXnPz1T6tvTjc9VcUvy3To/gnNbUlbzdgLeF9BfNjhmyOWUWTAu879/EL8ZOxErJBtZ8vyB2J+zlVTNMc4tYJchz0i6VF2vB8BndSg48TsECMo9WpSd/qRbSDridDDibYJTWdYy3g6Z2ShfZEDHRkZHAe9Gmy8mBU1GsGet5lQW6SmezzIc9P2BKix+cAVXB7JpR9PKlRkmU6ygLUX3nfJdXXjYN+2+25BlbV1p5YBKcqFAASS1IXWERdzIs7qMjV7me4miehfhxbgcbleS/XdDWf7jhABNzx+ABIua6v6eGy87aU/tkpvkCHOqgLaLmgzNpxroOUqaUmSsOlDFnzUaGLrRqzaS7slgYrlXlBYxg6yL4GVeCan+F9KfH5UkGZP+PgtNRdA2KLgfE8OwAgQ61VzX5q4QklNKmJl6py2PnPYpbvyr/7KVuHYps9 GY7bbF2T datd0FhUKF0c11Q964Rhm+zS1cHcn/KrJDnKzZWg97As/M+enSEj+zBPZcVnEjTaAkFqPB6HqNwVJhSC+l9ya62KwGRlKUkQ9JSxNqmiM9DlhztxmczIUo3cE+IKVV1sIw9TwivGKxg4faqW7dNkldgREC64gZgVTh5D/yFj3YubonnHQhbXevMHWRug4aCIJeVy0Q4M9HLpzsPnBporuWQ02slXeWj1VetbLADaS2FVxSySeOnKEdaSOIx+WAY16zax66S8V0Uf8Xy+jV1SEjuntcLmJ1dzRh2YbjJDJWjfDCVsbC2DU8We7KY9gyh2GAkDl6Cge+XPDIRFXDCmxYxH1okW9zW9R6KJcG3BJuglXGRBE+3T7r+lysZKIjNb5alnRs3DreHnvtyIkPJwBM5ftMOElsRYdH/mmputSIyqkOR1co2/Urm2zl9mjMiK5RGFj+z0vzIgUtNS8itUVXcuwIg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is v3. It is based on the 6.7.0-rc8. 1. Motivation - Offload global vmap locks making it scaled to number of CPUS; - If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple; - There were complains from XFS folk that a vmalloc might be contented on the their workloads. 2. Design(high level overview) We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock. An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes. Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space. When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node. A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted: static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; } On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request. 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush vs 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ 4. Changelog v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/ v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/ Delta v2 -> v3: - fix comments from v2 feedback; - switch from pre-fetch chunk logic to a less complex size based pools. Baoquan He (1): mm/vmalloc: remove vmap_area_list Uladzislau Rezki (Sony) (10): mm: vmalloc: Add va_alloc() helper mm: vmalloc: Rename adjust_va_to_fit_type() function mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c mm: vmalloc: Remove global vmap_area_root rb-tree mm: vmalloc: Remove global purge_vmap_area_root rb-tree mm: vmalloc: Offload free_vmap_area_lock lock mm: vmalloc: Support multiple nodes in vread_iter mm: vmalloc: Support multiple nodes in vmallocinfo mm: vmalloc: Set nr_nodes based on CPUs in a system mm: vmalloc: Add a shrinker to drain vmap pools .../admin-guide/kdump/vmcoreinfo.rst | 8 +- arch/arm64/kernel/crash_core.c | 1 - arch/riscv/kernel/crash_core.c | 1 - include/linux/vmalloc.h | 1 - kernel/crash_core.c | 4 +- kernel/kallsyms_selftest.c | 1 - mm/nommu.c | 2 - mm/vmalloc.c | 1049 ++++++++++++----- 8 files changed, 786 insertions(+), 281 deletions(-)