From patchwork Mon Mar 3 16:29:53 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13999173 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BA51C282D2 for ; Mon, 3 Mar 2025 16:30:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D2C6E280001; Mon, 3 Mar 2025 11:30:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CDBF16B0085; Mon, 3 Mar 2025 11:30:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2F9D280001; Mon, 3 Mar 2025 11:30:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8658F6B0083 for ; Mon, 3 Mar 2025 11:30:31 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4094D1A0CD8 for ; Mon, 3 Mar 2025 16:30:31 +0000 (UTC) X-FDA: 83180777862.17.A65A518 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf03.hostedemail.com (Postfix) with ESMTP id CD7F120011 for ; Mon, 3 Mar 2025 16:30:28 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GQhRmB0y; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741019428; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=z94iP1hoGdaqxJEP0csSOfM7B5vV14GxndPnByfrfkk=; b=e5lZUhbrEfXfSFdujwI6Ovtps4y9e9UnCg5ZoUsY6p5YEWnEzDDVbUq8OGCDRpKN+lNcWH 96339xx8qnwb01+SNEb+3sImViBBUWXeURx8Js7xturZd8vIcKwLvORLxQkVuwFR+9dz8J ChVYDzzStWeeN/+dxY6kOd1V70pa88c= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GQhRmB0y; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741019428; a=rsa-sha256; cv=none; b=UCCiRjisrekoNoYe+eVcf3ox15+BgMXf6/xr7HWDF8NtorPqa2keD0V7th8ov2QWEEZZUq SZc4PdfE3nLtp5F+NOw6iS6oWuo5eWLFMfTiWzaiGcRpEZ3Y8usrVezm5oTd+z5KoTrt1s FfXv3DIPDrcuCrxWbpr+2xD4qe6EknY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741019428; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=z94iP1hoGdaqxJEP0csSOfM7B5vV14GxndPnByfrfkk=; b=GQhRmB0yo8KjBiGaRo+affh5yL24VEl1x/Ibs3pfASh7Z9H3b1B9pi9mri2KklNfQFwNkj tI7j+ETLS+2Pmpd/YNxL1SkK+ofkXsLCHDeExrTiKIWdDx6NVULDTIgIgrIYstMzfz47ih s9VwrmKbkgjspdjn2uPcJEMJGdaxl68= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-357-2c3g-dhDNBaC1fkp2UIzjQ-1; Mon, 03 Mar 2025 11:30:18 -0500 X-MC-Unique: 2c3g-dhDNBaC1fkp2UIzjQ-1 X-Mimecast-MFC-AGG-ID: 2c3g-dhDNBaC1fkp2UIzjQ_1741019417 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-43947a0919aso41415905e9.0 for ; Mon, 03 Mar 2025 08:30:18 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741019417; x=1741624217; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=z94iP1hoGdaqxJEP0csSOfM7B5vV14GxndPnByfrfkk=; b=jVm1jg01+iNMz/RARtpMuITBXZOt/MHmqeP74QM83O4/bk1i3Z6UXwpbuA3CoSo/VQ w7sVP5jlYeI6o11r6w6omi/AVo//1NngIrQbbU5UawO7Oi2TPty73GN1AAIy9CffejiQ J9S3M1UVmoNTd0hJ3niUmY4+9ymyaTWRr3isZ5WgZEpNp2P9/3Q5B8cE5rvEWla+B4zS aM+MgEf+ApciZ5NlDkKuwFXfbdQaJmbpfvYkbfW4ie4xsFL+iNCq1QDmiQ9xDdKbJxF0 gNdWOFPMzgeIlDi1aXGyIlQUFUEhWfIK4td1q/zMBus11Ql45lAdvu7xiTGD1WzCWzqG q8LQ== X-Forwarded-Encrypted: i=1; AJvYcCVQ4qaqPyQhZ2MkHNFxrtlh5+ZXFZszmWYVdCpMPJTI9IHWbOdg515rlo2FZ0de0nndYCIrhwH6QA==@kvack.org X-Gm-Message-State: AOJu0YxBYmAmvrLSgljPEY6I0f5NdPORoZBMTfDzjiH8tYt6O5/7dHYq 4stT7X3hVIYuczXiqnc1nkd1t4fSezB19U26fTYbfanqBUB2YWrq0y5HXiqnsbDESqDzgEEZkEp fxqmquzcf1TxCLr19S4jWcwolT58BbnrBYfDE/v7t09e2WMCE X-Gm-Gg: ASbGncvw/V2lcV5I2Xn/gF88B2db9CpQAU2+nmD1r+WZajhrlG54Oz6NFPql37bwE3C AzetAIocBD0ihnhE15pktFUh/+bNTp6EtJ2RshjLVmYW9P94TlpGeu22Hek38I6mr4p5d1kymM/ w3yT1D0QtXY41NVW+hcIdbvZnv3xDBibhUyETF+1tWW4OtRUPr/7BW7m9jLRDXM2VCFWj3eToBy KDtdQEMCgtVkgoiK1MXQuHBE8P6foWAhz24jTW8itKL58oGky0uXyqRWg8qiYnGsVW19V5ZmWRt kPwcdL/mos6GEqUOvtWl7306GgeABVo0U2LWTBCvIlAIOSOPhQnuff5weZseGReZENZd5TzMbda k X-Received: by 2002:a05:600c:5246:b0:439:9a40:aa0b with SMTP id 5b1f17b1804b1-43bb64f1ce8mr54619555e9.25.1741019417385; Mon, 03 Mar 2025 08:30:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IE9h9cPVLAuewr23pDXEUpxwXa6a0sFQSeyADG1tdUOqtE7HWTucVTNAuNpxGq4bhaC5CPQoQ== X-Received: by 2002:a05:600c:5246:b0:439:9a40:aa0b with SMTP id 5b1f17b1804b1-43bb64f1ce8mr54618915e9.25.1741019416791; Mon, 03 Mar 2025 08:30:16 -0800 (PST) Received: from localhost (p200300cbc7349600af274326a2162bfb.dip0.t-ipconnect.de. [2003:cb:c734:9600:af27:4326:a216:2bfb]) by smtp.gmail.com with UTF8SMTPSA id ffacd0b85a97d-390e47b7d1dsm15119437f8f.56.2025.03.03.08.30.15 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Mar 2025 08:30:16 -0800 (PST) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-doc@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, David Hildenbrand , Andrew Morton , "Matthew Wilcox (Oracle)" , Tejun Heo , Zefan Li , Johannes Weiner , =?utf-8?q?Michal_Koutn=C3=BD?= , Jonathan Corbet , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Muchun Song , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn Subject: [PATCH v3 00/20] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT Date: Mon, 3 Mar 2025 17:29:53 +0100 Message-ID: <20250303163014.1128035-1-david@redhat.com> X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: hPBW1Iuj8kvCEeCuhcQxbPki1cOFiCIfJIvWItG-QQo_1741019417 X-Mimecast-Originator: redhat.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: CD7F120011 X-Rspam-User: X-Stat-Signature: 5g5674cm7z61ctefomuok5g9rupkdezp X-HE-Tag: 1741019428-698332 X-HE-Meta: U2FsdGVkX1+sAvuaOGmza6ty3xZZZy61xTfiq94Ka5B8bPFFQyFcw+YGHbvJ4fUECmwwusVos4leONFv4mGfjd4Ui10t2z0J23mfvjpGeA3mRrqS4C+7FvWrkE/davzpOTOlJJZqF49h/rpLEK8/02ZlEL33XYVXUU4VqRxb3mI8EJXq0nAvZJbK6OrHvUepDU8xCVmKFyM2jV47OOiOYit6imWkIVElHeHiGw9bKzrhdifC0fpOJ7CvJTKVkEqKNuB4CF+DxzXptwnCYD6WdzoZXqfnJl9LXVAHJ288q+8PfSisj0hp+/gUYED2sYn845oyInc/LS4IXpFd6yip0tBcb9DOZ0pAboj4RK9Lm0hcMHIvrfyYM3jVnMmuL2cNm2cWRejMdv67CIUA7R9jRwNppr6PI8DV7BbvQaRCh0aP5TEeiXfWmtwWPArfQRQ1f3D3CqFGkVmmK+e8FfW5VwAOy6ALH69JrFIFeTtW1mkKN5aOidjrs5ZTI06m0rBlNGA21NlVfwbYCKrAwVbFPzXYp18qKdRu3dX8us/cm/DdBFp8tDnGuWkcghsz8EzUMva+m8/Yo6gxs+sispG0hJS6OsqJhZJ4yWofaJZsADTcUJ251jM7Lgcn3O6tnuxoQ92zlwiSaPTsblGn7MmpSuPjrIOfrv5rZKWtS4gmBwJKgiJhZvjvcsl4jjZ3WSe6C7GElEolD5Y1vGpQXWKg1tOHO6iIZoM1hqiiuEn5x3+h4hg49I0sMP2LozwoiSWHAJ+n/QFSeGzUT9yiJVzxsWcX5AWQsYE5Ee7CDzoWkNuUgDDrMSz0guA8aqoJ5+XTk0D2GP56AW4p53GeXvVjNf28FNplWGZn2HfcXoMOholazE6uJGYGVqcaDESkFnMeoriDREKPTBYeTdda7LVJc83ugLzkroWOxitPMHjqk2NUSdzFIdMNYM3AxFhOJkXysR94otNv1Nf6IiSCnm/ lxtCQD8z Tfxkoxk02ix3p02QUz0+pgKK0RVC2IfHMdsX1qwRw8xk4VuaNScNsilDClNwxFRqBd6Kc/WhEEHKj7RVhUh/XWg/S/Vn5FtDm2chI4uJyDbM9HVLrUuJ+z0ZeN7OnmLaTvR9w37sieqKUGgX+CkyKpIfkhDfXTPXa0hVsZIh83xK1pVzdYRyzEGlFOSEucXzoHQS8Zy/0DgCBm6DZ8qaMTUYqlDVQ7dFkv5eZpiODzyBgpebjslHef54Abv/2iJq9ZnHHyTSbGKYNzHfRXbP/ITaorUWknm2rNEysLGjRzcZNT9GB/WEynrAFdmydDbBkO4lL6DhfLnLoeppA5zJ4x3X66hdfVoeWhZ1yTMlHorBkOjsZl8Tp7zgbVg0y+BWYQ6//Y0Rh0LCXotQPM5UIaK4XtjTFD4hOZJe8NazsmDrLaUEBA1qmESqtsz6UKckRpsfa6OxMHGtwLNwHwuISqowuN79ido4l0E9Wo+z8K4K3RcUZupWF1BCMbujrDbRq6gjgvDK/TfJoxApGXzy9K/hjdwq1zxIg3fUTqlyL0vIOgQ/Hrxq8RFR/NUFhqU1cW6VsFSZ0m05xcHcfghsFX5x3xQOVBg3xyNu5nSAA3oOd+z6Di+HhxLPjn4YOq/IiVHWFZQ79ab7gC1wYfQwLk8PkCbkczsACidHZXgx8B6i3oIbQt+EeAfTajYW0rX+i2DVKb8FC7mOWN+d6jDU+A4PmEtipxocLM3jUL0cm2A3tQpzwB8oZByA559RoAQyBhRjrutRjhhzuEhbEH0LdO9j4597VJ/euaischCY6Pc0FxWnwXr+tCv1YAWFQdpXU3ELaOVwWA4PChWu/cucfOtz+kivsO3TsRj2RKMg4V3nbqVkH/QuVoXqzsni/Qn9DDmza6zkjs+sL3725XJgrtFPOq9PLopZZby0kDuXeGLC3OkBrb7jaWRGftBX+tp8gHzKlczF157PFJZtv/zyykL0rzghG vuSCbDMr pfcNDJxGpc9HWNqewPD1BkVe5sR/1tULVQQYjwxNq3k7tS/D1D6fsA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Some smaller change based on Zi Yan's feedback (thanks!). Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that let's us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> #6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch #7 -> #11: preparations Patch #12: MM owner tracking for large folios Patch #13: COW reuse for PTE-mapped anon THP Patch #14: folio_maybe_mapped_shared() Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch #15 -> #20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff