mbox series

[v2,0/6] ARM: Add support for large kernel page (from 8K to 64K)

Message ID 20200611134914.765827-1-gregory.clement@bootlin.com (mailing list archive)
Headers show
Series ARM: Add support for large kernel page (from 8K to 64K) | expand

Message

Gregory CLEMENT June 11, 2020, 1:49 p.m. UTC
Hello,

On ARM based NAS it is possible to have storage volume larger than
16TB, especially with the use of LVM. However, on 32-bit architectures,
the page cache index is stored on 32 bits, which means that given a
page size of 4 KB, we can only address volumes of up to 16 TB.

Therefore, one option to use such large volumes and filesystems on 32
bits architecture is to increase the page size.

This series allows to support 8K, 16K, 32K and 64K kernel pages. On
ARM the size of the page can be either 4K or 64K, so for the other
size a "software emulation" is used, here Linux thinks it is using
pages of 8 KB, 16 KB or 32 KB, while underneath the MMU still uses 4
KB pages.

For ARM there is already a difference between the kernel page and the
hardware page in the way they are managed. In the same 4K space the
Linux kernel deals with 2 PTE tables at the beginning, while the
hardware deals with 2 other hardware PTE tables.

This series takes advantage of it and pushes further the difference
between hardware and Linux version by using larger page size at Linux
kernel level.

This series is inspired from fa0ca2726ea9 ("DSMP 64K support") and
4ef803e12baf ("mmu: large-page: Added support for multiple kernel page
sizes") from
https://github.com/MarvellEmbeddedProcessors/linux-marvell.git. This
feature was used since many years and intensively on real product.

The first 4 patches are preparation to make distinction between kernel
page size and hardware page size. For 4K kernel page they won't modify
anything.

The fifth patch is the one actually adding the support for the large
page kernel. This feature was restricted for ARM v7 and non LPAE
architecture. It could maybe be extended to support them, but until
now it has only been tested on ARMv7.

The last patch allows to use the hardware 64K large page.

Gregory

Gregory CLEMENT (6):
  ARM: Use PAGE_SIZE for ELF_EXEC_PAGESIZE
  ARM: pagetable: prepare hardware page table to use large page
  ARM: Make the number of fix bitmap depend on the page size
  ARM: mm: Aligned pte allocation to one page
  ARM: Add large kernel page support
  ARM: Add 64K page support at MMU level

 arch/arm/include/asm/elf.h                  |  2 +-
 arch/arm/include/asm/fixmap.h               |  3 +-
 arch/arm/include/asm/page.h                 | 12 ++++
 arch/arm/include/asm/pgtable-2level-hwdef.h |  8 +++
 arch/arm/include/asm/pgtable-2level.h       |  6 +-
 arch/arm/include/asm/pgtable.h              |  4 ++
 arch/arm/include/asm/shmparam.h             |  4 ++
 arch/arm/include/asm/tlbflush.h             | 21 +++++-
 arch/arm/kernel/entry-common.S              | 13 ++++
 arch/arm/kernel/traps.c                     | 10 +++
 arch/arm/mm/Kconfig                         | 72 +++++++++++++++++++++
 arch/arm/mm/fault.c                         | 19 ++++++
 arch/arm/mm/mmu.c                           | 22 ++++++-
 arch/arm/mm/pgd.c                           |  2 +
 arch/arm/mm/proc-v7-2level.S                | 72 ++++++++++++++++++++-
 arch/arm/mm/tlb-v7.S                        | 14 +++-
 16 files changed, 271 insertions(+), 13 deletions(-)

Comments

Russell King (Oracle) June 11, 2020, 4:21 p.m. UTC | #1
Hi Gregory,

You're on your own with this one; I've no motivation to re-understand
the ARM page table code now that 32-bit ARM is basically unsupported
now.

I'll point out some of the things you got wrong below though.

On Thu, Jun 11, 2020 at 03:49:08PM +0200, Gregory CLEMENT wrote:
> Hello,
> 
> On ARM based NAS it is possible to have storage volume larger than
> 16TB, especially with the use of LVM. However, on 32-bit architectures,
> the page cache index is stored on 32 bits, which means that given a
> page size of 4 KB, we can only address volumes of up to 16 TB.
> 
> Therefore, one option to use such large volumes and filesystems on 32
> bits architecture is to increase the page size.
> 
> This series allows to support 8K, 16K, 32K and 64K kernel pages. On
> ARM the size of the page can be either 4K or 64K, so for the other
> size a "software emulation" is used, here Linux thinks it is using
> pages of 8 KB, 16 KB or 32 KB, while underneath the MMU still uses 4
> KB pages.
> 
> For ARM there is already a difference between the kernel page and the
> hardware page in the way they are managed. In the same 4K space the
> Linux kernel deals with 2 PTE tables at the beginning, while the
> hardware deals with 2 other hardware PTE tables.

This is incorrect.  The kernel page size and the hardware page size
match today - both are 4k.  What your'e talking about here is the
PTE table size.

The kernel requires that each PTE table is contained within one
struct page.  Since one hardware PTE table is 256 entries, it
occupies 1024 bytes, so a quarter of a page.  So, to have a single
4k page per PTE table would waste quite a bit of space.

Now, the hardware PTE tables do not lend themselves to the kernel's
usage: the kernel wants additional bits to track the state of each
page in the page tables.  Hence, we need to shadow every PTE entry.
This also provides us independence of the underlying hardware PTE
entry format, which varies between ARM architecture versions.

So, we end up with a single 4k page containing two consecutive
hardware PTE tables, followed by two Linux PTE tables for the kernels
benefit.

If you increase the page size, then you need to increase the number
of tables in a page, or suffer a huge amount of wasted memory taken
for the page tables - going to an 8k page size means that the upper
4k of each page will not be used.  Going to 16k means the upper 12k
won't be used.  And so on - as your software page size increases,
the amount of memory wasted for each PTE table will increase
unless you also increase the number of hardware 1st level entries
pointing to each PTE page.  With 64k pages, 60k of each PTE page
will remain unused.

That isn't very efficient use of memory.
Gregory CLEMENT June 12, 2020, 9:15 a.m. UTC | #2
Hello Russell,

> Hi Gregory,
>
> You're on your own with this one; I've no motivation to re-understand
> the ARM page table code now that 32-bit ARM is basically unsupported
> now.

Understood.

>
> I'll point out some of the things you got wrong below though.

However thanks for your pointer.

>
> On Thu, Jun 11, 2020 at 03:49:08PM +0200, Gregory CLEMENT wrote:
>> Hello,
>> 
>> On ARM based NAS it is possible to have storage volume larger than
>> 16TB, especially with the use of LVM. However, on 32-bit architectures,
>> the page cache index is stored on 32 bits, which means that given a
>> page size of 4 KB, we can only address volumes of up to 16 TB.
>> 
>> Therefore, one option to use such large volumes and filesystems on 32
>> bits architecture is to increase the page size.
>> 
>> This series allows to support 8K, 16K, 32K and 64K kernel pages. On
>> ARM the size of the page can be either 4K or 64K, so for the other
>> size a "software emulation" is used, here Linux thinks it is using
>> pages of 8 KB, 16 KB or 32 KB, while underneath the MMU still uses 4
>> KB pages.
>> 
>> For ARM there is already a difference between the kernel page and the
>> hardware page in the way they are managed. In the same 4K space the
>> Linux kernel deals with 2 PTE tables at the beginning, while the
>> hardware deals with 2 other hardware PTE tables.
>
> This is incorrect.  The kernel page size and the hardware page size
> match today - both are 4k.  What your'e talking about here is the
> PTE table size.
>
> The kernel requires that each PTE table is contained within one
> struct page.  Since one hardware PTE table is 256 entries, it
> occupies 1024 bytes, so a quarter of a page.  So, to have a single
> 4k page per PTE table would waste quite a bit of space.
>
> Now, the hardware PTE tables do not lend themselves to the kernel's
> usage: the kernel wants additional bits to track the state of each
> page in the page tables.  Hence, we need to shadow every PTE entry.
> This also provides us independence of the underlying hardware PTE
> entry format, which varies between ARM architecture versions.
>
> So, we end up with a single 4k page containing two consecutive
> hardware PTE tables, followed by two Linux PTE tables for the kernels
> benefit.
>

It was what I understood, but I seemed I didn't formulate it accurately.

> If you increase the page size, then you need to increase the number
> of tables in a page, or suffer a huge amount of wasted memory taken
> for the page tables - going to an 8k page size means that the upper
> 4k of each page will not be used.  Going to 16k means the upper 12k
> won't be used.  And so on - as your software page size increases,
> the amount of memory wasted for each PTE table will increase
> unless you also increase the number of hardware 1st level entries
> pointing to each PTE page.  With 64k pages, 60k of each PTE page
> will remain unused.

Unfortunately I was aware of it. But I thought that it was an acceptable
drawback to be able to address large volume on a 32 bits ARM
system. Actually it is already the case on some product.

> That isn't very efficient use of memory.

Indeed however on a 3GB system, in the worst case we need 786432 pages
of 4K to map the memory. These pages can be mapped by 1536 block of 512
entries. So when the 64K pages are emulated we loose 92MB (around 3% of
the memory). So it is not negligible but given the use case I seems
acceptable.

Of course, it didn't prevent to try to do better.

Gregory
>
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC for 0.8m (est. 1762m) line in suburbia: sync at 13.1Mbps down 503kbps up
Arnd Bergmann June 12, 2020, 9:23 a.m. UTC | #3
On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:

> If you increase the page size, then you need to increase the number
> of tables in a page, or suffer a huge amount of wasted memory taken
> for the page tables - going to an 8k page size means that the upper
> 4k of each page will not be used.  Going to 16k means the upper 12k
> won't be used.  And so on - as your software page size increases,
> the amount of memory wasted for each PTE table will increase
> unless you also increase the number of hardware 1st level entries
> pointing to each PTE page.  With 64k pages, 60k of each PTE page
> will remain unused.
>
> That isn't very efficient use of memory.

I think this could be addressed by using the full page to contain
PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD
smaller, but there is an even bigger problem in the added memory
usage and I/O overhead for basically everything else: in any
sparsely populated memory mapped file or anonymous mapping,
the memory usage grows with the page size as well.

I think Synology's vendor kernels for their NAS boxes have a
different hack to make large file systems work, by extending
the internal data types (I forgot which ones) to 64 bit. That is
probably more invasive to the generic kernel code, but should
be much more efficient and less invasive to ARM architecture
specific code.

Either way, I wonder what the intended use cases are. Is this
work mainly intended for

a) running Debian/Buildroot/Yocto/... with (close to) upstream
   kernels on older NAS boxes,
b) commercial products that use 32-bit SoCs in multi-disk
    NAS boxes with vendor upgrades to future kernels, or
c) commercial products using 64-bit SoCs but 32-bit kernels?

My feeling is that any commercial products that need this are
either stuck on old kernels already, or they have moved on
to 64-bit chips and are better off running a 64-bit kernel[1], so
a) seems like the main purpose, right?

       Arnd
Catalin Marinas June 12, 2020, 12:21 p.m. UTC | #4
On Fri, Jun 12, 2020 at 11:23:11AM +0200, Arnd Bergmann wrote:
> On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> 
> > If you increase the page size, then you need to increase the number
> > of tables in a page, or suffer a huge amount of wasted memory taken
> > for the page tables - going to an 8k page size means that the upper
> > 4k of each page will not be used.  Going to 16k means the upper 12k
> > won't be used.  And so on - as your software page size increases,
> > the amount of memory wasted for each PTE table will increase
> > unless you also increase the number of hardware 1st level entries
> > pointing to each PTE page.  With 64k pages, 60k of each PTE page
> > will remain unused.
> >
> > That isn't very efficient use of memory.
> 
> I think this could be addressed by using the full page to contain
> PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD
> smaller, but there is an even bigger problem in the added memory
> usage and I/O overhead for basically everything else: in any
> sparsely populated memory mapped file or anonymous mapping,
> the memory usage grows with the page size as well.
> 
> I think Synology's vendor kernels for their NAS boxes have a
> different hack to make large file systems work, by extending
> the internal data types (I forgot which ones) to 64 bit. That is
> probably more invasive to the generic kernel code, but should
> be much more efficient and less invasive to ARM architecture
> specific code.

IIUC from Gregory's cover letter, the problem is page->index which is a
pgoff_t, unsigned long. This limits us to a 32-bit page offsets, so a
44-bit actual file offset (16TB). It may be worth exploring this than
hacking the page tables to pretend we have bigger page sizes.
Arnd Bergmann June 12, 2020, 12:49 p.m. UTC | #5
On Fri, Jun 12, 2020 at 2:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Jun 12, 2020 at 11:23:11AM +0200, Arnd Bergmann wrote:
> > On Thu, Jun 11, 2020 at 6:21 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> >
> > > If you increase the page size, then you need to increase the number
> > > of tables in a page, or suffer a huge amount of wasted memory taken
> > > for the page tables - going to an 8k page size means that the upper
> > > 4k of each page will not be used.  Going to 16k means the upper 12k
> > > won't be used.  And so on - as your software page size increases,
> > > the amount of memory wasted for each PTE table will increase
> > > unless you also increase the number of hardware 1st level entries
> > > pointing to each PTE page.  With 64k pages, 60k of each PTE page
> > > will remain unused.
> > >
> > > That isn't very efficient use of memory.
> >
> > I think this could be addressed by using the full page to contain
> > PTEs by making PTRS_PER_PTE larger and PTRS_PER_PGD
> > smaller, but there is an even bigger problem in the added memory
> > usage and I/O overhead for basically everything else: in any
> > sparsely populated memory mapped file or anonymous mapping,
> > the memory usage grows with the page size as well.
> >
> > I think Synology's vendor kernels for their NAS boxes have a
> > different hack to make large file systems work, by extending
> > the internal data types (I forgot which ones) to 64 bit. That is
> > probably more invasive to the generic kernel code, but should
> > be much more efficient and less invasive to ARM architecture
> > specific code.
>
> IIUC from Gregory's cover letter, the problem is page->index which is a
> pgoff_t, unsigned long. This limits us to a 32-bit page offsets, so a
> 44-bit actual file offset (16TB). It may be worth exploring this than
> hacking the page tables to pretend we have bigger page sizes.

Right, that's at least one type that needs to be changed, there
may be additional ones besides it. In Synology's patch, is also
a new rdx_t type that is defined the same way and used elsewhere,
at least for some SoCs (they use a maze of #ifdefs to merge
all the vendor kernels, and they also strip all code comments
and git history from the tarballs).

https://pastebin.com/e8C1zhzG has an attempt to split out
the relevant changes from the linux-3.10.105 tarball that they
use on Armada 385, see
https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/24922branch/armada38x-source/linux-3.10.x-bsp.txz/download
for the full kernel sources.

      Arnd