From patchwork Thu Jun 8 11:49:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Walleij X-Patchwork-Id: 13272197 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 62564C7EE23 for ; Thu, 8 Jun 2023 11:49:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BB0536B0072; Thu, 8 Jun 2023 07:49:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B5F0A6B0074; Thu, 8 Jun 2023 07:49:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4DA78E0001; Thu, 8 Jun 2023 07:49:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8F7996B0072 for ; Thu, 8 Jun 2023 07:49:35 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 5FBC1A023E for ; Thu, 8 Jun 2023 11:49:35 +0000 (UTC) X-FDA: 80879410710.03.31B67BC Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177]) by imf01.hostedemail.com (Postfix) with ESMTP id 5A65940008 for ; Thu, 8 Jun 2023 11:49:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linaro.org header.s=google header.b=dSSSScIP; spf=pass (imf01.hostedemail.com: domain of linus.walleij@linaro.org designates 209.85.208.177 as permitted sender) smtp.mailfrom=linus.walleij@linaro.org; dmarc=pass (policy=none) header.from=linaro.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686224973; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=frEdryyVhV6q8tg7U2G1Ts28C9LTfOhGgZWVLleMECc=; b=aIm+L6rFj9BQORb7ZxCdK5r0S8QeqO3ftyNlrUrZYFv5MuQb81g7EOFd+bleEtLIax67Bz FfIR9V7Wp1nmRVAc2kAlPh8lkcggC1AFixX+oZdCYq1u6EO019f4tXWnsZY9zADGEo3khm NTs0VG21ko9PgcuapY0pR7Md2vsUIOM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686224973; a=rsa-sha256; cv=none; b=FNMDVxApaBoyWQpbaAhuPYih93u/OawHkwFaVKmrnGGy9K9umXQW/yn+97UfWMFGQ5YuR1 5w/No2F/9diSF+7ks8MYWYrEVd+6bd/uv+gB70a7nSAzV3FpQ4fsdiQZfHAoGYbenTu8zj XgaTNYglO19zOpvtVU/UwMBdQ8IzVYI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linaro.org header.s=google header.b=dSSSScIP; spf=pass (imf01.hostedemail.com: domain of linus.walleij@linaro.org designates 209.85.208.177 as permitted sender) smtp.mailfrom=linus.walleij@linaro.org; dmarc=pass (policy=none) header.from=linaro.org Received: by mail-lj1-f177.google.com with SMTP id 38308e7fff4ca-2b1c5a6129eso4467461fa.2 for ; Thu, 08 Jun 2023 04:49:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1686224971; x=1688816971; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=frEdryyVhV6q8tg7U2G1Ts28C9LTfOhGgZWVLleMECc=; b=dSSSScIPUR2d8K8tjOg9z4UytWIn+uT/VnRBclmJ4T5vRXk7KnPfIHOxlyp7lI57ZA XeHL40MktVCYjIyJRU9V24WNEOm8KwQrvud4Ye1BIO8oPtk8OT9KKLnUu7OYX6JF2tPZ LMw4EiSWbegxBIAFmiPMR1/NriEcWxIxtgIDOTslsqJN3h5oyqR1fDptJz2speZuHRvS vrZoMJWHO7dAdcMAT8RXhRYcT+oKr2tXQ/TpiYqoo32NOHL5XAI0O0jIj1sGVWB7DHad MbRK6Nq9SUSAqttYnHstb171SiuUHmC9LSLeeGsVl/jyEeCtYoZdnWddYtzgqYIaa8vN rVbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686224971; x=1688816971; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=frEdryyVhV6q8tg7U2G1Ts28C9LTfOhGgZWVLleMECc=; b=WiYI5QQ6J9KTAgawVkf0ElGLZVMzrqwNK8jx/upVcYTGM6zsC+Ju3HQaWWJANTqLgX tOHwg71Kp7ZFjQQ7UYQwX+9XblgbTNO1fUehWxAGRzUP6u669zR7UFuSLulQp+GoLWCA 59zcRQZm+X3LWQm1QKSnnHOuHTRXL0wHaFNL1O24zFxia5qGkudsR51NzTuuDIGjgv42 Xfx3B1qAXA66mthOZKUOifok3xgSSjJ06Wfv7g1i/r0RMnx3oWwuJjL1YRDx9jFrRXEu giIDRcXzT6lHi14vw3GLFcB9AudLvZrkx+kvjwinIPvXVX5aAnWkepUnk10ee31KxJ/z xZ+A== X-Gm-Message-State: AC+VfDzPsEwC5v1SnY6q5avnI1Ic5R4b+NXToOjiWsxFNKmRRDvbsfkd E7lrluLFzZAcKxzyiIZsbtl0hA== X-Google-Smtp-Source: ACHHUZ4jb2Gmde6i1nqaQsZmtF3S9SiEjuph71/CtFnQ9WT4/8JnAIgo2QTtoXSCooZBoYN4RKJrHA== X-Received: by 2002:a2e:98ca:0:b0:2b1:eb30:668a with SMTP id s10-20020a2e98ca000000b002b1eb30668amr2922527ljj.36.1686224971366; Thu, 08 Jun 2023 04:49:31 -0700 (PDT) Received: from Fecusia.lan (c-05d8225c.014-348-6c756e10.bbcust.telenor.se. [92.34.216.5]) by smtp.gmail.com with ESMTPSA id y26-20020a2eb01a000000b002adf8d948dasm222447ljk.35.2023.06.08.04.49.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jun 2023 04:49:30 -0700 (PDT) From: Linus Walleij To: Andrew Morton , Jonathan Corbet Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, Linus Walleij , Matthew Wilcox , Randy Dunlap , Mike Rapoport Subject: [PATCH v2] Documentation/mm: Initial page table documentation Date: Thu, 8 Jun 2023 13:49:28 +0200 Message-Id: <20230608114928.3955640-1-linus.walleij@linaro.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 5A65940008 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: zt8qcf44u16aueikebw78qf6n4ys4475 X-HE-Tag: 1686224973-18141 X-HE-Meta: U2FsdGVkX1+as2y0NyJGcf6xkRe+WeZr+8Z96dsNAn2/CdIEzZ3qR/L0ISdSnebWpxwbWWmColzSKf7nFGgfnn/YpspNeetxo77qHcaUvLyutKMDCZNPPECCvsO7sw9okwxwGPfaD5P+I3bCKEZfD/Wc0cmTm5mcp3n9lEK/3sKDUqzovOf8O+7md4E8dGMupmxe0act1Ehex36oq6AM98J65Wg1+mdFdaoKNbbDW0fvdQqzKzO+uxoLyxUeULvpM+47D7Zv2de6Z9eyZoJwwQftdsuBX8JvetK31lFi9qE+eV42l1O1dPh5F/KSjMu5eQItPACUYaXhNZQeh8qgfUtn8klF6NjVzxT975WkUmxeJh8Mq24k5SZqKnbid+jZB4Lh/RDaIMqF4w5WWxkhV4L+Ue6XP5eAJRlx0l/3y5/mKkfZTCUVy+ysKvIz7mr2MC5mX5vvqcrEdFO0AXt/Xq+2RidMXBmwR9PVGmxHPBiK/1R1BnnRDediJ5JgXAfMIipeynz34ip50/Lswy0KhUFoBVyHzJ09dRsi9Ue7On3nxcG8S73BX6juMKt2WAhVE0jikchHROqZvjdYIAxvWy8NfLHUFl052vavSpN/vJU0sowy1McxU8sy20vq3xJCnYIhk9WqM2MDDgvwFojPZKkdKweUqMG6lzcwyBtJs/NbmupfkIa3Ouh6eiyjFQqiuCCgGt/w9zy8/RLhsWl7ZLEt/jtuDF8Wq4ws6R1EwS8j0orgi4qOMr9LWVoWoxXBk/4VoB+ypPpn8ifFbWyLGn7WM6DWVJDHva/o4z/0iy7HKYeVz0SmanuUL5NnZZrhnMKGJCs790aO3Vm9nSDdMn6Ve+H6O2srZCano17DMVDexdic7SrG+JwjUzDsRgzedaXPDlBLWWtbDZWxSyBnoQlftTe1ZxyjZg/OQFiH1CwtDqQtJTTR/vQ9kMO3tZgAAYbfAwEJGYtYwUYHxz6 99Dt2/F7 tedOXCA+jPOdVz3uUdCn7UMkvzfeY/A9gBxDrP4qtaB4TyJrEGdde0c61UQfAlg+PzY5vOyQwEYgBQx/EKZ6POdREMUJdFF6rkWaVfrgQtoHDaMzWV3OIUeqHWkU7vXrWqSZ8e2oEUCZbzM4j4yQelQnGiUSkZeZHhv4oHeWhj0wqcwl9CFFDAkDM5BxrJouYcCeh1sM9SzNTfuF/f9Utu6RTVWTxDJuquKNa90wUauQuCphRqn5G240WtbqY0Ys/vWj9UB7d0Brk0dLpMK7XZ7mH3wgfMaKQTYsLmBCG6rEIcAYrxRC5R/PqVP84DLINpA1ipf3Qcn5i2AjJVLU5J3nMvKWzJf1H3l0p+LA6rzn6SJqvKBFgdIKYS73cewrNXyi3ef89qY84OdOAcBv3dbPqgLAkfNJhJbvSvRNFu2x/Aq/LE8FQPSis9YqES5Ga+H9q9G4c+GwBLoigSUbxW0OwoMZIjM0LVze0Chh7j+TaEf6sEAd/L1kZyRgoJ6e7/moRC5Tnyph4JnYHM98WMaC4NfyYR98Kwiqj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is based on an earlier blog post at people.kernel.org, it describes the concepts about page tables that were hardest for me to grasp when dealing with them for the first time, such as the prevalent three-letter acronyms pfn, pgd, p4d, pud, pmd and pte. I don't know if this is what people want, but it's what I would have wanted. I discussed at one point with Mike Rapoport to bring this into the kernel documentation, so here is a small proposal. Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Mike Rapoport Link: https://people.kernel.org/linusw/arm32-page-tables Signed-off-by: Linus Walleij Reviewed-by: Bagas Sanjaya --- ChangeLog v1->v2: - Fixed speling mistakes - Copyedit the paragraph on page frame numbers. - Reverse the arrows in the page table hierarchy illustration. - Reverse the order of description of the page hierarchy levels. - Create a new section for folding - Emphasize that architectures should try to be page hierarchy neutral - Trying to better describe the fact that the lowest page table PTE is called like that for historical reasons. --- Documentation/mm/page_tables.rst | 131 +++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst index 96939571d7bc..315d295d1740 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -3,3 +3,134 @@ =========== Page Tables =========== + +Paged virtual memory was invented along with virtual memory as a concept in +1962 on the Ferranti Atlas Computer which was the first computer with paged +virtual memory. The feature migrated to newer computers and became a de facto +feature of all Unix-like systems as time went by. In 1985 the feature was +included in the Intel 80386, which was the CPU Linux 1.0 was developed on. + +Page tables map virtual addresses as seen by the CPU program counter into +physical addresses as seen on the external memory bus. + +Linux defines page tables as a hierarchy which is currently five levels in +height. The target architecture code for each supported architecture will then +map this to the restrictions of the target hardware. + +The physical address corresponding to the virtual address is often referenced +by the underlying physical page frame. The **page frame number** or **pfn** +is the physical address of the page (as seen on the external memory bus) +divided by `PAGE_SIZE`. + +Physical memory address 0 will be *pfn 0* and the highest pfn will be +the last page of physical memory the external address bus of the CPU can +address. + +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at 0x00008000 +and so on until we reach pfn 0x3ffff at 0xffffc000. + +As you can see, with 4KB pages the page base address uses bits 12-31 of the +address, and this is why `PAGE_SHIFT` in this case is defined as 12 and +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)` + +Over time a deeper hierarchy has been developed in response to increasing memory +sizes. When Linux was created, 4KB pages and a single page table called +`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with +the fact that Torvald's first computer had 4MB of physical memory. Entries in +this single table was referred to as *PTE*:s - page table entries. + +Over time the page table hierarchy has developed into this:: + + +-----+ + | PGD | + +-----+ + | + | +-----+ + +-->| P4D | + +-----+ + | + | +-----+ + +-->| PUD | + +-----+ + | + | +-----+ + +-->| PMD | + +-----+ + | + | +-----+ + +-->| PTE | + +-----+ + + +Symbols on the different levels of the page table hierarchy have the following +meaning beginning from the bottom: + +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. + The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each + mapping a single page of virtual memory to a single page of physical memory. + The architecture defines the size and contents of `pteval_t`. + + A typical example is that the `pteval_t` is a 32- or 64-bit value with the + upper bits being a **pfn** (page frame number), and the lower bits being some + architecture-specific bits such as memory protection. + + The **entry** part of the name is a bit confusing because while in Linux 1.0 + this did refer to a single page table entry in the single top level page + table, it was retrofitted to be an array of mapping elements when two-level + page tables were first introduced, so the *pte* is the lowermost page + *table*, not a page table *entry*. + +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right + above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s. + +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after + the other levels to handle 4-level page tables. It is potentially unused, + or *folded* as we will discuss later. + +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to + handle 5-level page tables after the *pud* was introduced. Now it was clear + that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the + directory level and that we cannot go on with ad hoc names any more. This + is only used on systems which actually have 5 levels of page tables, otherwise + it is folded. + +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel + main page table handling the PGD for the kernel memory is still found in + `swapper_pg_dir`, but each userspace process in the system also has its own + memory context and thus its own *pgd*, found in `struct mm_struct` which + in turn is referenced to in each `struct task_struct`. So tasks have memory + context in the form of a `struct mm_struct` and this in turn has a + `struct pgt_t *pgd` pointer to the corresponding page global directory. + +To repeat: each level in the page table hierarchy is a *array of pointers*, so +the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d** +contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of +pointers on each level is architecture-defined.:: + + PMD + --> +-----+ PTE + | ptr |-------> +-----+ + | ptr |- | ptr |-------> PAGE + | ptr | \ | ptr | + | ptr | \ ... + | ... | \ + | ptr | \ PTE + +-----+ +----> +-----+ + | ptr |-------> PAGE + | ptr | + ... + + +Page Table Folding +================== + +If the architecture does not use all the page table levels, they can be *folded* +which means skipped, and all operations performed on page tables will be +compile-time augmented to just skip a level when accessing the next lower +level. + +Page table handling code that wishes to be architecture-neutral, such as the +virtual memory manager, will need to be written so that it traverses all of the +currently five levels. This style should also be preferred for +architecture-specific code, so as to be robust to future changes.