From patchwork Fri Aug 18 11:19:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Fabio M. De Francesco" X-Patchwork-Id: 13357777 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBBEAC05052 for ; Fri, 18 Aug 2023 11:27:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B9913280040; Fri, 18 Aug 2023 07:27:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B48F7940053; Fri, 18 Aug 2023 07:27:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A101D280040; Fri, 18 Aug 2023 07:27:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 90F3A940053 for ; Fri, 18 Aug 2023 07:27:36 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4B76414074B for ; Fri, 18 Aug 2023 11:27:36 +0000 (UTC) X-FDA: 81137000112.19.3B38206 Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) by imf18.hostedemail.com (Postfix) with ESMTP id 6630C1C0013 for ; Fri, 18 Aug 2023 11:27:33 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=pWoUd3H7; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692358053; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=FAZrEllq0Wy5uzPwEmIuVHBlwOQ/GuCMXULwdhFYbPA=; b=fkW+YcPpHeHwi3yBWGfbLbC7mFPhz9JwgOR2BdvAoJeRXbFtRu76Ya9wGyyPFWcWaHHZap Dp+MRbzr6Ku8jSVQa/iDJy1vyNNt9T599GZWYQ0r3aL+H+PGtdpjFkujj+A1hJz2DsOfcU 2fISqLEcFH2H1F0QqS4jK/2j1isY5+M= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=pWoUd3H7; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692358053; a=rsa-sha256; cv=none; b=zEo/eF5or9odj5TrMYa5JC9Bq+tcNw1sC797GBPmaAn4NjbsZL5Qlwq9n70c1UecRc6M4H ciCm5Ex91e2RSMnPJt0W9Om5sz3IO/fLP20Uq3g5OmkwuLqJMFOR9n+Kiuk0aJ9vjAxhFa fxX1u/wU2AKOVfQWV6v6azO0MH9kbQ0= Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-31ad779e6b3so714130f8f.2 for ; Fri, 18 Aug 2023 04:27:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1692358052; x=1692962852; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=FAZrEllq0Wy5uzPwEmIuVHBlwOQ/GuCMXULwdhFYbPA=; b=pWoUd3H7bpgF8f5t1c1wDTnvSEZS36Tzl301Ln8XQpeDWf9PndgkGIXJSfFTgy4aB3 bWkzhGQWJEfcbFttGNUQuOuN/eVHxBW/EN6TTljoKsNu/wGpskvxs35ALamBQTWG73eR k1nLHXnOfcgcCnZ4dRthqUU0x8Xhg8s/3c+caMafbCGTCpWsc3t4zmoKZuC/lzzlF1XF S9qnJM7KePjtpV7vXqBgg7EhAQ3nBGZ7+60fqolVBRZ8fQQYCQz51UWQoLB2xInEVL3e E1frCN78e1ss492yMLv4ebzN3XkMW0Y4b5ADlBfTrzRu9XwrUYKcWjYN4M8jLjME1rG0 zzCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692358052; x=1692962852; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=FAZrEllq0Wy5uzPwEmIuVHBlwOQ/GuCMXULwdhFYbPA=; b=TED6jAAZVirYPxUVTqLD0DepH+QfI0Ia0r1dKLl1TphwjLju9MxmPF8HzxTDqAILyu vHyTwZDTTcWFDFUmk4foQ0MOYFYlYoyTUJzrXC3XgdheybbV/zc6mketY7i7TlDbJYaD hQ6Xi5s2jQ/2Jnd1+cooA3iGb1FxuJMRhMXg2YqXaORJdB+xXOygK8uD85KgzmeB1Aa+ 7qIZQOdc/5x0+YpQRm27EG/c09DXrc+VAZYQW8y7+xK8fSd3ODRYs0wZDCk0/ldkb2Zq hyXp+5Nt6AU0cKxbea4OxgpwJhh0Fs0BpjJpnt306GCB4k4CBvwCWGK40c8bJWNX6L3N rSfA== X-Gm-Message-State: AOJu0YxJ3/s5NU+MC3UDednmGx10f3OXgx2NO8UskTGxSqz0C1iKhkHs 5h9Hdp/2ydCKGBTu50M9nr2n+DrK06U= X-Google-Smtp-Source: AGHT+IG+Oudn/ihjE3uGCKh5RoOOaiQy8R7lDFTz5NY8QfhP4UXEEqNHPnViQpIAevzY/157i6mPyQ== X-Received: by 2002:adf:e78d:0:b0:319:7c07:87bf with SMTP id n13-20020adfe78d000000b003197c0787bfmr1977066wrm.53.1692358051606; Fri, 18 Aug 2023 04:27:31 -0700 (PDT) Received: from localhost.localdomain (host-79-42-95-235.retail.telecomitalia.it. [79.42.95.235]) by smtp.gmail.com with ESMTPSA id j7-20020adfea47000000b00317731a6e07sm2475940wrn.62.2023.08.18.04.27.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Aug 2023 04:27:30 -0700 (PDT) From: "Fabio M. De Francesco" To: Jonathan Corbet , Jonathan Cameron , Linus Walleij , Mike Rapoport , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Fabio M. De Francesco" , Andrew Morton , Ira Weiny , Matthew Wilcox , Randy Dunlap Subject: [PATCH v3] Documentation/page_tables: Add info about MMU/TLB and Page Faults Date: Fri, 18 Aug 2023 13:19:34 +0200 Message-ID: <20230818112726.6156-1-fmdefrancesco@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 6630C1C0013 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: ahjrhh3gzpq3hm87xmf1b87b3cc5coip X-HE-Tag: 1692358053-779597 X-HE-Meta: U2FsdGVkX1/+qvOvcph9eIBX9H39EFLxMR/BU3PYO38mgj6nsvQEND6XuuHwZEDP2JeT8q1qF7dojTzXGcDeYQwUPzxjvgXUVL5798pCA2WrtI9eoQW1mvISbAvol/WO22wJLvHfgZZC4+vWB5CqcaxCHHrQh3vWuPiPqmF2dUyAPsGiWkbEwpmgpqMMpi8v48IGwqE7sEC/lwRrqsprzesTRE+34fT40Il7sLQAPHanbxZP0uDCexiTtzAbS/5Fd6eVPcjL0deTM+HbefddEZduFT2oQ0rdRmmwWK4AKfPVkSqhwYsYyX0wB5ZrxjRM7lVp/YbNl71F+wzHzILwAEIrRSkqyh0w+5t+p/EvDv49PXUcYD62VNTlyTpBnbPsIASdnyy33CwRu4ssyTG1y9UY1zdE7rMJsbpvrRXRyEQsVXP8fONruK+11auMmCNYCch+ou3b4QLPF2tqsO/pdo11qSLj1duvYwgFV8TfjRohrEMciWqXJORPY8rbhM+f6yu4biST9JWMldNsChCOlW7LPvHg0DyWL4PPetfDF0Pza7h3n6ejBdQkLWAfSpSQTYU9ZYwWUWrYhZjPi0ltM35ETyy9SZwil9nZ+fuzCk67L5Cq6/XsDeb9f4HIz7dOQUZTlO2lYdTxYv3qyUBCXIfCeDVP8wdqnS8ktbA9PUq8dnAGXcU0ecHUjf9Q9YfSvREEnwjVUfSMBSS8EylV9bC2tgzNl82XAQSr45Q9P5yDGJiL+Vm83PIG0lAfyMMDnsJei7YF8hX6wkKG0v9841im/s12zr0Zl+T2w5ndmzvHO9Vy4vSkO9ZygfkiejXKe9Xzn/ANV+VkNpuE2B5HNtsDAsLTgEJKpdH+QKGizbg68xtW7tkGJB5N0BniFVOTqc3IB6vZEdfTUfVlZYhPjZHe+0noQUG8Vl9W3asryvsi7JEwHVKs5vu065v1My1T9/VXKfGZ5Z8fj5HE4Jc rh6P06gU uQkFLUlqfkBHb1GJ1m4ybFv2Nmlgk53xEbSTmcdHxCV0TFJPyLa7JXiMmVUCk9ZPiNwJKEu7o1LQ/G/KvtzyxF9bDeatOjG7ODgvgIiApBeZwxHuao8c3xB4m7zdQoaN71BnrPP/u0uheuoFYNBM0u+c+ecHI1mE8tI/+pEvsJjYiGJ+192Vuvw/Y+i9D1Z+jsQ2xTuqDkEqQxut9/BgsYJkbH0InQKN6cP4H3a/kUuR+zewcGVK3eI+m12UDHqqmiNOe37PY+qSI+bJZLqPMBsMeksxBj1qF7wbwrmDyYRypLJOYQHyHUT8RvTJ1HLwfkJzoxx+AzofH8+yX/4ZQTqy8uw8Pn5W9GxVWZWuLHKfTKcf/fUBXKWlRKVmGZI9y6vcVE/5K+Ijm55OLhYJQeQyIj+oAZr58OvTW/FU0Ry7GK+l0C6Qk3XwmKvukJYc7dEIOHGFA+sd8KOO2w8ddwVgpsxEQHZZt6V5eOXg3CzsIF/EPfvozj8qXBK4p5Al3ylZdieRTBQTwz54u//GIZKjA6N8DaZU0KO/xBueesm7swQ1TYHuD48XhqFcI9IajEz6goUjmGqidgI+Oq1z35o7Li6RuA/66CWCIh0xrVgnvlKJVOOAsz+uCx35QOzGNUC6KaxyEOwXU3+LSjilvJ7ohiHySBQsGi3Bfmph/fA9l/jRB3AA5jaF+nZswUNuijA/r3ERLxDPeR4RUVqoavNlD1Xgx7yeEB1n5q49047JOCRKemxRoi2+kwyTDht1QkoK5ti8dKneibdXnQ4yDE4vFUFLAEEcz89r+p3z7fifF31Jjlg14hyyM93LfoQl8X2OoED5fVAc9E/I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Extend page_tables.rst by adding a section about the role of MMU and TLB in translating between virtual addresses and physical page frames. Furthermore explain the concept behind Page Faults and how the Linux kernel handles TLB misses. Finally briefly explain how and why to disable the page faults handler. Cc: Andrew Morton Cc: Ira Weiny Cc: Jonathan Cameron Cc: Jonathan Corbet Cc: Linus Walleij Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Randy Dunlap Reviewed-by: Linus Walleij Signed-off-by: Fabio M. De Francesco Acked-by: Mike Rapoport (IBM) --- v2 -> v3: This version fixes the grammar mistakes found by Linus and forwards his "Reviewed-by" tag (thanks!). https://lore.kernel.org/all/CACRpkdbq8UCtvtRH7FZUEqvTxPQcoGbrKvf_mT5QHMAfVoYNNQ@mail.gmail.com/ v1 -> v2: This version takes into account the comments provided by Mike (thanks!). I hope I haven't overlooked anything he suggested :-) https://lore.kernel.org/all/20230807105010.GK2607694@kernel.org/ Furthermore, v2 adds few more information about swapping which was not present in v1. before the "real" patch, this has been an RFC PATCH in its 2nd version for a week or so until I received comments and suggestions from Jonathan Cameron (thanks!), and then it morphed to a real patch. The link to the thread with the RFC PATCH v2 and the messages between Jonathan and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com/#r Documentation/mm/page_tables.rst | 127 +++++++++++++++++++++++++++++++ 1 file changed, 127 insertions(+) diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst index 7840c1891751..be47b192a596 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes. + + +MMU, TLB, and Page Faults +========================= + +The `Memory Management Unit (MMU)` is a hardware component that handles virtual +to physical address translations. It may use relatively small caches in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up +these translations. + +When CPU accesses a memory location, it provides a virtual address to the MMU, +which checks if there is the existing translation in the TLB or in the Page +Walk Caches (on architectures that support them). If no translation is found, +MMU uses the page walks to determine the physical address and create the map. + +The dirty bit for a page is set (i.e., turned on) when the page is written to. +Each page of memory has associated permission and dirty bits. The latter +indicate that the page has been modified since it was loaded into memory. + +If nothing prevents it, eventually the physical memory can be accessed and the +requested operation on the physical frame is performed. + +There are several reasons why the MMU can't find certain translations. It could +happen because the CPU is trying to access memory that the current task is not +permitted to, or because the data is not present into physical memory. + +When these conditions happen, the MMU triggers page faults, which are types of +exceptions that signal the CPU to pause the current execution and run a special +function to handle the mentioned exceptions. + +There are common and expected causes of page faults. These are triggered by +process management optimization techniques called "Lazy Allocation" and +"Copy-on-Write". Page faults may also happen when frames have been swapped out +to persistent storage (swap partition or file) and evicted from their physical +locations. + +These techniques improve memory efficiency, reduce latency, and minimize space +occupation. This document won't go deeper into the details of "Lazy Allocation" +and "Copy-on-Write" because these subjects are out of scope as they belong to +Process Address Management. + +Swapping differentiates itself from the other mentioned techniques because it's +undesirable since it's performed as a means to reduce memory under heavy +pressure. + +Swapping can't work for memory mapped by kernel logical addresses. These are a +subset of the kernel virtual space that directly maps a contiguous range of +physical memory. Given any logical address, its physical address is determined +with simple arithmetic on an offset. Accesses to logical addresses are fast +because they avoid the need for complex page table lookups at the expenses of +frames not being evictable and pageable out. + +If the kernel fails to make room for the data that must be present in the +physical frames, the kernel invokes the out-of-memory (OOM) killer to make room +by terminating lower priority processes until pressure reduces under a safe +threshold. + +Additionally, page faults may be also caused by code bugs or by maliciously +crafted addresses that the CPU is instructed to access. A thread of a process +could use instructions to address (non-shared) memory which does not belong to +its own address space, or could try to execute an instruction that want to write +to a read-only location. + +If the above-mentioned conditions happen in user-space, the kernel sends a +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually +causes the termination of the thread and of the process it belongs to. + +This document is going to simplify and show an high altitude view of how the +Linux kernel handles these page faults, creates tables and tables' entries, +check if memory is present and, if not, requests to load data from persistent +storage or from other devices, and updates the MMU and its caches. + +The first steps are architecture dependent. Most architectures jump to +`do_page_fault()`, whereas the x86 interrupt handler is defined by the +`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. + +Whatever the routes, all architectures end up to the invocation of +`handle_mm_fault()` which, in turn, (likely) ends up calling +`__handle_mm_fault()` to carry out the actual work of allocating the page +tables. + +The unfortunate case of not being able to call `__handle_mm_fault()` means +that the virtual address is pointing to areas of physical memory which are not +permitted to be accessed (at least from the current context). This +condition resolves to the kernel sending the above-mentioned SIGSEGV signal +to the process and leads to the consequences already explained. + +`__handle_mm_fault()` carries out its work by calling several functions to +find the entry's offsets of the upper layers of the page tables and allocate +the tables that it may need. + +The functions that look for the offset have names like `*_offset()`, where the +"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the +corresponding tables, layer by layer, are called `*_alloc`, using the +above-mentioned convention to name them after the corresponding types of tables +in the hierarchy. + +The page table walk may end at one of the middle or upper layers (PMD, PUD). + +Linux supports larger page sizes than the usual 4KB (i.e., the so called +`huge pages`). When using these kinds of larger pages, higher level pages can +directly map them, with no need to use lower level page entries (PTE). Huge +pages contain large contiguous physical regions that usually span from 2MB to +1GB. They are respectively mapped by the PMD and PUD page entries. + +The huge pages bring with them several benefits like reduced TLB pressure, +reduced page table overhead, memory allocation efficiency, and performance +improvement for certain workloads. However, these benefits come with +trade-offs, like wasted memory and allocation challenges. + +At the very end of the walk with allocations, if it didn't return errors, +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` +performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. +"read", "cow", "shared" give hints about the reasons and the kind of fault it's +handling. + +The actual implementation of the workflow is very complex. Its design allows +Linux to handle page faults in a way that is tailored to the specific +characteristics of each architecture, while still sharing a common overall +structure. + +To conclude this high altitude view of how Linux handles page faults, let's +add that the page faults handler can be disabled and enabled respectively with +`pagefault_disable()` and `pagefault_enable()`. + +Several code path make use of the latter two functions because they need to +disable traps into the page faults handler, mostly to prevent deadlocks.