Message ID | 20200611134914.765827-1-gregory.clement@bootlin.com (mailing list archive) |
---|---|
Headers | show |
Series | ARM: Add support for large kernel page (from 8K to 64K) | expand |
Hi Gregory, You're on your own with this one; I've no motivation to re-understand the ARM page table code now that 32-bit ARM is basically unsupported now. I'll point out some of the things you got wrong below though. On Thu, Jun 11, 2020 at 03:49:08PM +0200, Gregory CLEMENT wrote: > Hello, > > On ARM based NAS it is possible to have storage volume larger than > 16TB, especially with the use of LVM. However, on 32-bit architectures, > the page cache index is stored on 32 bits, which means that given a > page size of 4 KB, we can only address volumes of up to 16 TB. > > Therefore, one option to use such large volumes and filesystems on 32 > bits architecture is to increase the page size. > > This series allows to support 8K, 16K, 32K and 64K kernel pages. On > ARM the size of the page can be either 4K or 64K, so for the other > size a "software emulation" is used, here Linux thinks it is using > pages of 8 KB, 16 KB or 32 KB, while underneath the MMU still uses 4 > KB pages. > > For ARM there is already a difference between the kernel page and the > hardware page in the way they are managed. In the same 4K space the > Linux kernel deals with 2 PTE tables at the beginning, while the > hardware deals with 2 other hardware PTE tables. This is incorrect. The kernel page size and the hardware page size match today - both are 4k. What your'e talking about here is the PTE table size. The kernel requires that each PTE table is contained within one struct page. Since one hardware PTE table is 256 entries, it occupies 1024 bytes, so a quarter of a page. So, to have a single 4k page per PTE table would waste quite a bit of space. Now, the hardware PTE tables do not lend themselves to the kernel's usage: the kernel wants additional bits to track the state of each page in the page tables. Hence, we need to shadow every PTE entry. This also provides us independence of the underlying hardware PTE entry format, which varies between ARM architecture versions. So, we end up with a single 4k page containing two consecutive hardware PTE tables, followed by two Linux PTE tables for the kernels benefit. If you increase the page size, then you need to increase the number of tables in a page, or suffer a huge amount of wasted memory taken for the page tables - going to an 8k page size means that the upper 4k of each page will not be used. Going to 16k means the upper 12k won't be used. And so on - as your software page size increases, the amount of memory wasted for each PTE table will increase unless you also increase the number of hardware 1st level entries pointing to each PTE page. With 64k pages, 60k of each PTE page will remain unused. That isn't very efficient use of memory.
Hello Russell, > Hi Gregory, > > You're on your own with this one; I've no motivation to re-understand > the ARM page table code now that 32-bit ARM is basically unsupported > now. Understood. > > I'll point out some of the things you got wrong below though. However thanks for your pointer. > > On Thu, Jun 11, 2020 at 03:49:08PM +0200, Gregory CLEMENT wrote: >> Hello, >> >> On ARM based NAS it is possible to have storage volume larger than >> 16TB, especially with the use of LVM. However, on 32-bit architectures, >> the page cache index is stored on 32 bits, which means that given a >> page size of 4 KB, we can only address volumes of up to 16 TB. >> >> Therefore, one option to use such large volumes and filesystems on 32 >> bits architecture is to increase the page size. >> >> This series allows to support 8K, 16K, 32K and 64K kernel pages. On >> ARM the size of the page can be either 4K or 64K, so for the other >> size a "software emulation" is used, here Linux thinks it is using >> pages of 8 KB, 16 KB or 32 KB, while underneath the MMU still uses 4 >> KB pages. >> >> For ARM there is already a difference between the kernel page and the >> hardware page in the way they are managed. In the same 4K space the >> Linux kernel deals with 2 PTE tables at the beginning, while the >> hardware deals with 2 other hardware PTE tables. > > This is incorrect. The kernel page size and the hardware page size > match today - both are 4k. What your'e talking about here is the > PTE table size. > > The kernel requires that each PTE table is contained within one > struct page. Since one hardware PTE table is 256 entries, it > occupies 1024 bytes, so a quarter of a page. So, to have a single > 4k page per PTE table would waste quite a bit of space. > > Now, the hardware PTE tables do not lend themselves to the kernel's > usage: the kernel wants additional bits to track the state of each > page in the page tables. Hence, we need to shadow every PTE entry. > This also provides us independence of the underlying hardware PTE > entry format, which varies between ARM architecture versions. > > So, we end up with a single 4k page containing two consecutive > hardware PTE tables, followed by two Linux PTE tables for the kernels > benefit. > It was what I understood, but I seemed I didn't formulate it accurately. > If you increase the page size, then you need to increase the number > of tables in a page, or suffer a huge amount of wasted memory taken > for the page tables - going to an 8k page size means that the upper > 4k of each page will not be used. Going to 16k means the upper 12k > won't be used. And so on - as your software page size increases, > the amount of memory wasted for each PTE table will increase > unless you also increase the number of hardware 1st level entries > pointing to each PTE page. With 64k pages, 60k of each PTE page > will remain unused. Unfortunately I was aware of it. But I thought that it was an acceptable drawback to be able to address large volume on a 32 bits ARM system. Actually it is already the case on some product. > That isn't very efficient use of memory. Indeed however on a 3GB system, in the worst case we need 786432 pages of 4K to map the memory. These pages can be mapped by 1536 block of 512 entries. So when the 64K pages are emulated we loose 92MB (around 3% of the memory). So it is not negligible but given the use case I seems acceptable. Of course, it didn't prevent to try to do better. Gregory > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTC for 0.8m (est. 1762m) line in suburbia: sync at 13.1Mbps down 503kbps up
On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > If you increase the page size, then you need to increase the number > of tables in a page, or suffer a huge amount of wasted memory taken > for the page tables - going to an 8k page size means that the upper > 4k of each page will not be used. Going to 16k means the upper 12k > won't be used. And so on - as your software page size increases, > the amount of memory wasted for each PTE table will increase > unless you also increase the number of hardware 1st level entries > pointing to each PTE page. With 64k pages, 60k of each PTE page > will remain unused. > > That isn't very efficient use of memory. I think this could be addressed by using the full page to contain PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD smaller, but there is an even bigger problem in the added memory usage and I/O overhead for basically everything else: in any sparsely populated memory mapped file or anonymous mapping, the memory usage grows with the page size as well. I think Synology's vendor kernels for their NAS boxes have a different hack to make large file systems work, by extending the internal data types (I forgot which ones) to 64 bit. That is probably more invasive to the generic kernel code, but should be much more efficient and less invasive to ARM architecture specific code. Either way, I wonder what the intended use cases are. Is this work mainly intended for a) running Debian/Buildroot/Yocto/... with (close to) upstream kernels on older NAS boxes, b) commercial products that use 32-bit SoCs in multi-disk NAS boxes with vendor upgrades to future kernels, or c) commercial products using 64-bit SoCs but 32-bit kernels? My feeling is that any commercial products that need this are either stuck on old kernels already, or they have moved on to 64-bit chips and are better off running a 64-bit kernel[1], so a) seems like the main purpose, right? Arnd
On Fri, Jun 12, 2020 at 11:23:11AM +0200, Arnd Bergmann wrote: > On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > If you increase the page size, then you need to increase the number > > of tables in a page, or suffer a huge amount of wasted memory taken > > for the page tables - going to an 8k page size means that the upper > > 4k of each page will not be used. Going to 16k means the upper 12k > > won't be used. And so on - as your software page size increases, > > the amount of memory wasted for each PTE table will increase > > unless you also increase the number of hardware 1st level entries > > pointing to each PTE page. With 64k pages, 60k of each PTE page > > will remain unused. > > > > That isn't very efficient use of memory. > > I think this could be addressed by using the full page to contain > PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD > smaller, but there is an even bigger problem in the added memory > usage and I/O overhead for basically everything else: in any > sparsely populated memory mapped file or anonymous mapping, > the memory usage grows with the page size as well. > > I think Synology's vendor kernels for their NAS boxes have a > different hack to make large file systems work, by extending > the internal data types (I forgot which ones) to 64 bit. That is > probably more invasive to the generic kernel code, but should > be much more efficient and less invasive to ARM architecture > specific code. IIUC from Gregory's cover letter, the problem is page->index which is a pgoff_t, unsigned long. This limits us to a 32-bit page offsets, so a 44-bit actual file offset (16TB). It may be worth exploring this than hacking the page tables to pretend we have bigger page sizes.
On Fri, Jun 12, 2020 at 2:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, Jun 12, 2020 at 11:23:11AM +0200, Arnd Bergmann wrote: > > On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > If you increase the page size, then you need to increase the number > > > of tables in a page, or suffer a huge amount of wasted memory taken > > > for the page tables - going to an 8k page size means that the upper > > > 4k of each page will not be used. Going to 16k means the upper 12k > > > won't be used. And so on - as your software page size increases, > > > the amount of memory wasted for each PTE table will increase > > > unless you also increase the number of hardware 1st level entries > > > pointing to each PTE page. With 64k pages, 60k of each PTE page > > > will remain unused. > > > > > > That isn't very efficient use of memory. > > > > I think this could be addressed by using the full page to contain > > PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD > > smaller, but there is an even bigger problem in the added memory > > usage and I/O overhead for basically everything else: in any > > sparsely populated memory mapped file or anonymous mapping, > > the memory usage grows with the page size as well. > > > > I think Synology's vendor kernels for their NAS boxes have a > > different hack to make large file systems work, by extending > > the internal data types (I forgot which ones) to 64 bit. That is > > probably more invasive to the generic kernel code, but should > > be much more efficient and less invasive to ARM architecture > > specific code. > > IIUC from Gregory's cover letter, the problem is page->index which is a > pgoff_t, unsigned long. This limits us to a 32-bit page offsets, so a > 44-bit actual file offset (16TB). It may be worth exploring this than > hacking the page tables to pretend we have bigger page sizes. Right, that's at least one type that needs to be changed, there may be additional ones besides it. In Synology's patch, is also a new rdx_t type that is defined the same way and used elsewhere, at least for some SoCs (they use a maze of #ifdefs to merge all the vendor kernels, and they also strip all code comments and git history from the tarballs). https://pastebin.com/e8C1zhzG has an attempt to split out the relevant changes from the linux-3.10.105 tarball that they use on Armada 385, see https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/24922branch/armada38x-source/linux-3.10.x-bsp.txz/download for the full kernel sources. Arnd