mbox series

[RFC,v2,00/30] 1GB PUD THP support on x86_64

Message ID 20200928175428.4110504-1-zi.yan@sent.com (mailing list archive)
Headers show
Series 1GB PUD THP support on x86_64 | expand

Message

Zi Yan Sept. 28, 2020, 5:53 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset adds support for 1GB PUD THP on x86_64. It is on top of
v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23

Other than PUD THP, we had some discussion on generating THPs and contiguous
physical memory via a synchronous system call [0]. I am planning to send out a
separate patchset on it later, since I feel that it can be done independently of
PUD THP support.

Any comment or suggestion is welcome. Thanks.

Motiation
====
The patchset is trying to provide a more transparent way of boosting virtual
memory performance by leveraging gigantic TLB entries compared to hugetlbfs
pages [1,2]. Roman also said he would provide performance numbers of using 1GB
PUD THP once the patchset is a relatively good shape [1].


Patchset organization:
====

1. Patch 1 and 2: Jason's PUD entry READ_ONCE patch to walk_page_range to give
   a consistent read of PUD entries during lockless page table walks.
   I also add PMD entry READ_ONCE patch, since PMD level walk_page_range has
   the same lockless behavior as PUD level.

2. Patch 3: THP page table deposit now use single linked list to enable
   hierarchical page table deposit, i.e., deposit a PMD page where 512 PTE pages
   are deposited to.  Every page table page has a deposit_head and a deposit_node.
   For example, when storing 512 PTE pages to a PMD page, PMD page's deposit_head
   links to a PTE page's deposit_node, which links to another PTE page's
   deposit_node.

3. Patch 4,5,6: helper functions for allocating page table pages for PUD THPs
   and change thp_order and thp_nr.

4. Patch 7 to 23: PUD THP implementation. It is broken into small patches for
   easy review.

5. Patch 24, 25: new page size encoding for MADV_HUGEPAGE and MADV_NOHUGEPAGE in
   madvise. User can specify THP size. Only MADV_HUGEPAGE_1GB is used accepted.
   VM_HUGEPAGE_PUD is added to vm_flags to store the information at big 37.
   You are welcome to suggest any other approach.

6. Patch 26, 27: enable_pud_thp and hpage_pud_size are added to
   /sys/kernel/mm/transparent_hugepage/. enable_pud_thp is set to never by
   default.

7. Patch 28, 29: PUD THPs are allocated only from boot-time reserved CMA regions.
   The CMA regions can be used for other moveable page allocations.


Design for PUD-, PMD-, and PTE-mapped PUD THP
====

One additional design compared to PMD THP is the support for PMD-mapped PUD THP,
since original THP design supports PUD-mapped and PTE-mapped PUD THP
automatically.

PMD mapcounts are stored at (512*N + 3) subpages (N = 0 to 511) and 512*N
subpages are called PMDPageInPUD. A PUDDoubleMap bit is stored at third
subpage of a PUD THP, using the same page flag position as DoubleMap (stored
at second subpage of a PMD THP), to indicate a PUD THP with both PUD and
PMD mappings.


A PUD THP looks like:

┌───┬───┬───┬───┬─────┬───┬───┬───┬───┬────────┬──────┐
│ H │ T │ T │ T │ ... │ T │ T │ T │ T │  ...   │  T   │
│ 0 │ 1 │ 2 │ 3 │     │512│513│514│515│        │262143│
└───┴───┴───┴───┴─────┴───┴───┴───┴───┴────────┴──────┘

PMDPageInPUD pages in a PUD THP (only show first two PMDPageInPUD pages below).
Note that PMDPageInPUD pages are identified by their relative position to the
head page of the PUD THP and are still tail pages except the first one,
so H_0, T_512, T_1024, ... T_512x511 are all PMDPageInPUD pages:

 ┌────────────┬──────────────┬────────────┬──────────────┬───────────────────┐
 │PMDPageInPUD│     ...      │PMDPageInPUD│     ...      │  the remaining    │
 │    page    │ 511 subpages │    page    │ 511 subpages │ 510x512 subpages  │
 └────────────┴──────────────┴────────────┴──────────────┴───────────────────┘


Mapcount positions:

* For each subpage, its PTE mapcount is _mapcount, the same as PMD THP.
* For PUD THP, its PUD-mapping uses compound_mapcount at T_1 the same as PMD THP.
* For PMD-mapped PUD THP, its PMD-mapping uses compound_mapcount at T_3, T_515,
  ..., T_512x511+3. It is called sub_compound_mapcount.

PUDDoubleMap and DoubleMap in PUD THP:

* PUDDoubleMap is stored at the page flag of T_2 (third subpage), reusing the
  DoubleMap's position.
* DoubleMap is stored at the page flags of T_1 (second subpage), T_513, ...,
  T_512x511+1.

[0] https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/
[1] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[2] https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/


Changelog from RFC v1
====
1. Add Jason's PUD entry READ_ONCE patch and my PMD entry READ_ONCE patch to
   get consistent page table entry reading in lockless page table walks.
2. Use single linked list for page table page deposit instead of pagechain
   data structure from RFC v1.
3. Address Kirill's comments.
4. Remove PUD page allocation via alloc_contig_pages(), using cma_alloc only.
5. Add madvise flag MADV_HUGEPAGE_1GB to explicitly enable PUD THP on specific
   VMAs instead of reusing MADV_HUGEPAGE. A new vm_flags VM_HUGEPAGE_PUD is
   added to achieve this.
6. Break large patches in v1 into small ones for easy review.

Jason Gunthorpe (1):
  mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked

Zi Yan (29):
  mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked
  mm: thp: use single linked list for THP page table page deposit.
  mm: add new helper functions to allocate one PMD page with 512 PTE
    pages.
  mm: thp: add page table deposit/withdraw functions for PUD THP.
  mm: change thp_order and thp_nr as we will have not just PMD THPs.
  mm: thp: add anonymous PUD THP page fault support without enabling it.
  mm: thp: add PUD THP support for copy_huge_pud.
  mm: thp: add PUD THP support to zap_huge_pud.
  fs: proc: add PUD THP kpageflag.
  mm: thp: handling PUD THP reference bit.
  mm: rmap: add mappped/unmapped page order to anonymous page rmap
    functions.
  mm: rmap: add map_order to page_remove_anon_compound_rmap.
  mm: thp: add PUD THP split_huge_pud_page() function.
  mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
  mm: debug: adapt dump_page to PUD THP.
  mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
  mm: thp: PUD THP follow_p*d_page() support.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: PUD THP support in try_to_unmap().
  mm: thp: split PUD THPs at page reclaim.
  mm: support PUD THP pagemap support.
  mm: madvise: add page size options to MADV_HUGEPAGE and
    MADV_NOHUGEPAGE.
  mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
  mm: thp: add a global knob to enable/disable PUD THPs.
  mm: thp: make PUD THP size public.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.
  mm: thp: enable anonymous PUD THP at page fault path.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/admin-guide/mm/transhuge.rst    |   1 +
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  69 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   3 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 200 +++-
 include/linux/cma.h                           |  18 +
 include/linux/huge_mm.h                       |  84 +-
 include/linux/hugetlb.h                       |  12 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mm.h                            |  42 +-
 include/linux/mm_types.h                      |  11 +-
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  48 +
 include/linux/pagewalk.h                      |   4 +-
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/swap.h                          |   2 +
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/asm-generic/mman-common.h        |  23 +
 include/uapi/linux/kernel-page-flags.h        |   1 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |  10 +-
 mm/cma.c                                      | 119 +++
 mm/debug.c                                    |   6 +-
 mm/gup.c                                      |  60 +-
 mm/hmm.c                                      |  16 +-
 mm/huge_memory.c                              | 899 +++++++++++++++++-
 mm/hugetlb.c                                  | 117 +--
 mm/khugepaged.c                               |  16 +-
 mm/ksm.c                                      |   4 +-
 mm/madvise.c                                  |  76 +-
 mm/mapping_dirty_helpers.c                    |   6 +-
 mm/memcontrol.c                               |  43 +-
 mm/memory.c                                   |  28 +-
 mm/mempolicy.c                                |  29 +-
 mm/migrate.c                                  |  12 +-
 mm/mincore.c                                  |  10 +-
 mm/page_alloc.c                               |  53 +-
 mm/page_vma_mapped.c                          | 171 +++-
 mm/pagewalk.c                                 |  47 +-
 mm/pgtable-generic.c                          |  49 +-
 mm/ptdump.c                                   |   3 +-
 mm/rmap.c                                     | 300 ++++--
 mm/swap.c                                     |  30 +
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |  11 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  22 +-
 mm/vmscan.c                                   |  33 +-
 mm/vmstat.c                                   |   8 +
 58 files changed, 2396 insertions(+), 460 deletions(-)

--
2.28.0

Comments

Michal Hocko Sept. 30, 2020, 11:55 a.m. UTC | #1
On Mon 28-09-20 13:53:58, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> 
> Other than PUD THP, we had some discussion on generating THPs and contiguous
> physical memory via a synchronous system call [0]. I am planning to send out a
> separate patchset on it later, since I feel that it can be done independently of
> PUD THP support.

While the technical challenges for the kernel implementation can be
discussed before the user API is decided I believe we cannot simply add
something now and then decide about a proper interface. I have raised
few basic questions we should should find answers for before the any
interface is added. Let me copy them here for easier reference
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
  ranges
- do we need some sort of access control or privilege check as some THPs
  would be a really scarce (like those that require pre-reservation).
Zi Yan Oct. 1, 2020, 3:14 p.m. UTC | #2
On 30 Sep 2020, at 7:55, Michal Hocko wrote:

> On Mon 28-09-20 13:53:58, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
>> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
>>
>> Other than PUD THP, we had some discussion on generating THPs and contiguous
>> physical memory via a synchronous system call [0]. I am planning to send out a
>> separate patchset on it later, since I feel that it can be done independently of
>> PUD THP support.
>
> While the technical challenges for the kernel implementation can be
> discussed before the user API is decided I believe we cannot simply add
> something now and then decide about a proper interface. I have raised
> few basic questions we should should find answers for before the any
> interface is added. Let me copy them here for easier reference
Sure. Thank you for doing this.

For this new interface, I think it should generate THPs out of populated
memory regions synchronously. It would be complement to khugepaged, which
generate THPs asynchronously on the background.

> - THP allocation time - #PF and/or madvise context
I am not sure this is relevant, since the new interface is supposed to
operate on populated memory regions. For THP allocation, madvise and
the options from /sys/kernel/mm/transparent_hugepage/defrag should give
enough choices to users.

> - lazy/sync instantiation

I would say the new interface only does sync instantiation. madvise has
provided the lazy instantiation option by adding MADV_HUGEPAGE to populated
memory regions and letting khugepaged generate THPs from them.

> - huge page sizes controllable by the userspace?

It might be good to allow advanced users to choose the page sizes, so they
have better control of their applications. For normal users, we can provide
best-effort service. Different options can be provided for these two cases.
The new interface might want to inform user how many THPs are generated
after the call for them to decide what to do with the memory region.

> - aggressiveness - how hard to try

The new interface would try as hard as it can, since I assume users really
want THPs when they use this interface.

> - internal fragmentation - allow to create THPs on sparsely or unpopulated
>   ranges

The new interface would only operate on populated memory regions. MAP_POPULATE
like option can be added if necessary.


> - do we need some sort of access control or privilege check as some THPs
>   would be a really scarce (like those that require pre-reservation).

It seems too much to me. I suppose if we provide page size options to users
when generating THPs, users apps could coordinate themselves. BTW, do we have
access control for hugetlb pages? If yes, we could borrow their method.


—
Best Regards,
Yan Zi
Michal Hocko Oct. 2, 2020, 7:32 a.m. UTC | #3
On Thu 01-10-20 11:14:14, Zi Yan wrote:
> On 30 Sep 2020, at 7:55, Michal Hocko wrote:
> 
> > On Mon 28-09-20 13:53:58, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Hi all,
> >>
> >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> >>
> >> Other than PUD THP, we had some discussion on generating THPs and contiguous
> >> physical memory via a synchronous system call [0]. I am planning to send out a
> >> separate patchset on it later, since I feel that it can be done independently of
> >> PUD THP support.
> >
> > While the technical challenges for the kernel implementation can be
> > discussed before the user API is decided I believe we cannot simply add
> > something now and then decide about a proper interface. I have raised
> > few basic questions we should should find answers for before the any
> > interface is added. Let me copy them here for easier reference
> Sure. Thank you for doing this.
> 
> For this new interface, I think it should generate THPs out of populated
> memory regions synchronously. It would be complement to khugepaged, which
> generate THPs asynchronously on the background.
> 
> > - THP allocation time - #PF and/or madvise context
> I am not sure this is relevant, since the new interface is supposed to
> operate on populated memory regions. For THP allocation, madvise and
> the options from /sys/kernel/mm/transparent_hugepage/defrag should give
> enough choices to users.

OK, so no #PF, this makes things easier.

> > - lazy/sync instantiation
> 
> I would say the new interface only does sync instantiation. madvise has
> provided the lazy instantiation option by adding MADV_HUGEPAGE to populated
> memory regions and letting khugepaged generate THPs from them.

OK

> > - huge page sizes controllable by the userspace?
> 
> It might be good to allow advanced users to choose the page sizes, so they
> have better control of their applications.

Could you elaborate more? Those advanced users can use hugetlb, right?
They get a very good control over page size and pool preallocation etc.
So they can get what they need - assuming there is enough memory.

> For normal users, we can provide
> best-effort service. Different options can be provided for these two cases.

Do we really need two sync mechanisms to compact physical memory? This
adds an API complexity because it has to cover all possible huge pages
and that can be a large set of sizes. We already have that choice for
hugetlb mmap interface but that is needed to cover all existing setups.
I would argue this doesn't make the API particurarly easy to use.

> The new interface might want to inform user how many THPs are generated
> after the call for them to decide what to do with the memory region.

Why would that be useful? /proc/<pid>/smaps should give a good picture
already, right?

> > - aggressiveness - how hard to try
> 
> The new interface would try as hard as it can, since I assume users really
> want THPs when they use this interface.
> 
> > - internal fragmentation - allow to create THPs on sparsely or unpopulated
> >   ranges
> 
> The new interface would only operate on populated memory regions. MAP_POPULATE
> like option can be added if necessary.

OK, so initialy you do not want to populate more memory. How do you
envision a future extension to provide such a functionality. A different
API, modification to existing?

> > - do we need some sort of access control or privilege check as some THPs
> >   would be a really scarce (like those that require pre-reservation).
> 
> It seems too much to me. I suppose if we provide page size options to users
> when generating THPs, users apps could coordinate themselves. BTW, do we have
> access control for hugetlb pages? If yes, we could borrow their method.

We do not. Well, there is a hugetlb cgroup controller but I am not sure
this is the right method. A lack of hugetlb access controll is a serious
shortcoming which has turned this interface into "only first class
citizens" feature with a very closed coordination with an admin.
David Hildenbrand Oct. 2, 2020, 7:50 a.m. UTC | #4
>>> - huge page sizes controllable by the userspace?
>>
>> It might be good to allow advanced users to choose the page sizes, so they
>> have better control of their applications.
> 
> Could you elaborate more? Those advanced users can use hugetlb, right?
> They get a very good control over page size and pool preallocation etc.
> So they can get what they need - assuming there is enough memory.
> 

I am still not convinced that 1G THP (TGP :) ) are really what we want
to support. I can understand that there are some use cases that might
benefit from it, especially:

"I want a lot of memory, give me memory in any granularity you have, I
absolutely don't care - but of course, more TGP might be good for
performance." Say, you want a 5GB region, but only have a single 1GB
hugepage lying around. hugetlbfs allocation will fail.


But then, do we really want to optimize for such (very special?) use
cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ?

I think gigantic pages are a sparse resource. Only selected applications
*really* depend on them and benefit from them. Let these special
applications handle it explicitly.

Can we have a summary of use cases that would really benefit from this
change?
Michal Hocko Oct. 2, 2020, 8:10 a.m. UTC | #5
On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>> - huge page sizes controllable by the userspace?
> >>
> >> It might be good to allow advanced users to choose the page sizes, so they
> >> have better control of their applications.
> > 
> > Could you elaborate more? Those advanced users can use hugetlb, right?
> > They get a very good control over page size and pool preallocation etc.
> > So they can get what they need - assuming there is enough memory.
> > 
> 
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:

Well, I would say that internal support for larger huge pages (e.g. 1GB)
that can transparently split under memory pressure is a useful
funtionality. I cannot really judge how complex that would be
consideting that 2MB THP have turned out to be quite a pain but
situation has settled over time. Maybe our current code base is prepared
for that much better.

Exposing that interface to the userspace is a different story of course.
I do agree that we likely do not want to be very explicit about that.
E.g. an interface for address space defragmentation without any more
specifics sounds like a useful feature to me. It will be up to the
kernel to decide which huge pages to use.
David Hildenbrand Oct. 2, 2020, 8:30 a.m. UTC | #6
On 02.10.20 10:10, Michal Hocko wrote:
> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>> - huge page sizes controllable by the userspace?
>>>>
>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>> have better control of their applications.
>>>
>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>> They get a very good control over page size and pool preallocation etc.
>>> So they can get what they need - assuming there is enough memory.
>>>
>>
>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>> to support. I can understand that there are some use cases that might
>> benefit from it, especially:
> 
> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> that can transparently split under memory pressure is a useful
> funtionality. I cannot really judge how complex that would be

Right, but that's then something different than serving (scarce,
unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
wrong about *real* THP support, meaning, e.g., grouping consecutive
pages and converting them back and forth on demand. (E.g., 1GB ->
multiple 2MB -> multiple single pages), for example, when having to
migrate such a gigantic page. But that's very different from our
existing gigantic page code as far as I can tell.

> consideting that 2MB THP have turned out to be quite a pain but
> situation has settled over time. Maybe our current code base is prepared
> for that much better.
> 
> Exposing that interface to the userspace is a different story of course.
> I do agree that we likely do not want to be very explicit about that.
> E.g. an interface for address space defragmentation without any more
> specifics sounds like a useful feature to me. It will be up to the
> kernel to decide which huge pages to use.

Yes, I think one important feature would be that we don't end up placing
a gigantic page where only a handful of pages are actually populated
without green light from the application - because that's what some user
space applications care about (not consuming more memory than intended.
IIUC, this is also what this patch set does). I'm fine with placing
gigantic pages if it really just "defragments" the address space layout,
without filling unpopulated holes.

Then, this would be mostly invisible to user space, and we really
wouldn't have to care about any configuration.
Zi Yan Oct. 5, 2020, 3:03 p.m. UTC | #7
On 2 Oct 2020, at 4:30, David Hildenbrand wrote:

> On 02.10.20 10:10, Michal Hocko wrote:
>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>> - huge page sizes controllable by the userspace?
>>>>>
>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>> have better control of their applications.
>>>>
>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>> They get a very good control over page size and pool preallocation etc.
>>>> So they can get what they need - assuming there is enough memory.
>>>>
>>>
>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>> to support. I can understand that there are some use cases that might
>>> benefit from it, especially:
>>
>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>> that can transparently split under memory pressure is a useful
>> funtionality. I cannot really judge how complex that would be
>
> Right, but that's then something different than serving (scarce,
> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> wrong about *real* THP support, meaning, e.g., grouping consecutive
> pages and converting them back and forth on demand. (E.g., 1GB ->
> multiple 2MB -> multiple single pages), for example, when having to
> migrate such a gigantic page. But that's very different from our
> existing gigantic page code as far as I can tell.

Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
which needs section size increase. In addition, unmoveable pages cannot
be allocated in CMA, so allocating 1GB pages has much higher chance from
it than from ZONE_NORMAL.


>> consideting that 2MB THP have turned out to be quite a pain but
>> situation has settled over time. Maybe our current code base is prepared
>> for that much better.

I am planning to refactor my code further to reduce the amount of
the added code, since PUD THP is very similar to PMD THP. One thing
I want to achieve is to enable split_huge_page to split any order of
pages to a group of any lower order of pages. A lot of code in this
patchset is replicating the same behavior of PMD THP at PUD level.
It might be possible to deduplicate most of the code.

>>
>> Exposing that interface to the userspace is a different story of course.
>> I do agree that we likely do not want to be very explicit about that.
>> E.g. an interface for address space defragmentation without any more
>> specifics sounds like a useful feature to me. It will be up to the
>> kernel to decide which huge pages to use.
>
> Yes, I think one important feature would be that we don't end up placing
> a gigantic page where only a handful of pages are actually populated
> without green light from the application - because that's what some user
> space applications care about (not consuming more memory than intended.
> IIUC, this is also what this patch set does). I'm fine with placing
> gigantic pages if it really just "defragments" the address space layout,
> without filling unpopulated holes.
>
> Then, this would be mostly invisible to user space, and we really
> wouldn't have to care about any configuration.


I agree that the interface should be as simple as no configuration to
most users. But I also wonder why we have hugetlbfs to allow users to
specify different kinds of page sizes, which seems against the discussion
above. Are we assuming advanced users should always use hugetlbfs instead
of THPs?


—
Best Regards,
Yan Zi
Zi Yan Oct. 5, 2020, 3:34 p.m. UTC | #8
On 2 Oct 2020, at 3:50, David Hildenbrand wrote:

>>>> - huge page sizes controllable by the userspace?
>>>
>>> It might be good to allow advanced users to choose the page sizes, so they
>>> have better control of their applications.
>>
>> Could you elaborate more? Those advanced users can use hugetlb, right?
>> They get a very good control over page size and pool preallocation etc.
>> So they can get what they need - assuming there is enough memory.
>>
>
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:
>
> "I want a lot of memory, give me memory in any granularity you have, I
> absolutely don't care - but of course, more TGP might be good for
> performance." Say, you want a 5GB region, but only have a single 1GB
> hugepage lying around. hugetlbfs allocation will fail.
>
>
> But then, do we really want to optimize for such (very special?) use
> cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ?

I am planning to further refactor my code to reduce the size and make
it more general to support any size of THPs. As Matthew’s patchset[1]
is removing kernel’s THP size assumption, it might be a good time to
make THP support more general.

>
> I think gigantic pages are a sparse resource. Only selected applications
> *really* depend on them and benefit from them. Let these special
> applications handle it explicitly.
>
> Can we have a summary of use cases that would really benefit from this
> change?

For large machine learning applications, 1GB pages give good performance boost[2].
NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not
that sparse in GPU-equipped infrastructure[3].

In addition, @Roman Gushchin should be able to provide a more concrete
story from his side.


[1] https://lore.kernel.org/linux-mm/20200908195539.25896-1-willy@infradead.org/
[2] http://learningsys.org/neurips19/assets/papers/18_CameraReadySubmission_MLSys_NeurIPS_2019.pdf
[3] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf

—
Best Regards,
Yan Zi
Matthew Wilcox (Oracle) Oct. 5, 2020, 3:55 p.m. UTC | #9
On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> > Yes, I think one important feature would be that we don't end up placing
> > a gigantic page where only a handful of pages are actually populated
> > without green light from the application - because that's what some user
> > space applications care about (not consuming more memory than intended.
> > IIUC, this is also what this patch set does). I'm fine with placing
> > gigantic pages if it really just "defragments" the address space layout,
> > without filling unpopulated holes.
> >
> > Then, this would be mostly invisible to user space, and we really
> > wouldn't have to care about any configuration.
> 
> I agree that the interface should be as simple as no configuration to
> most users. But I also wonder why we have hugetlbfs to allow users to
> specify different kinds of page sizes, which seems against the discussion
> above. Are we assuming advanced users should always use hugetlbfs instead
> of THPs?

Evolution doesn't always produce the best outcomes ;-)

A perennial mistake we've made is "Oh, this is a strange & new & weird
feature that most applications will never care about, let's put it in
hugetlbfs where nobody will notice and we don't have to think about it
in the core VM"

And then what was initially strange & new & weird gradually becomes
something that most applications just want to have happen automatically,
and telling them all to go use hugetlbfs becomes untenable, so we move
the feature into the core VM.

It is absurd that my phone is attempting to manage a million 4kB pages.
I think even trying to manage a quarter-million 16kB pages is too much
work, and really it would be happier managing 65,000 64kB pages.

Extend that into the future a decade or two, and we'll be expecting
that it manages memory in megabyte sized units and uses PMD and PUD
mappings by default.  PTE mappings will still be used, but very much
on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
smaller pages to not waste too much memory when mapping it" basis.  So,
yeah, PUD sized mappings have problems today, but we should be writing
software now so a Pixel 15 in a decade can boot a kernel built five
years from now and have PUD mappings Just Work without requiring the
future userspace programmer to "use hugetlbfs".

One of the longer-term todo items is to support variable sized THPs for
anonymous memory, just like I've done for the pagecache.  With that in
place, I think scaling up from PMD sized pages to PUD sized pages starts
to look more natural.  Itanium and PA-RISC (two architectures that will
never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
The RiscV spec you pointed me at the other day confines itself to adding
support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
sizes would be possible additions in the future.


But, back to today, what to do with this patchset?  Even on my 16GB
laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
page is ever the right decision to make.  But my laptop runs a "mixed"
workload, and if you could convince me that Firefox would run 10% faster
by using a 1GB page as its in-memory cache, well, I'd be sold.

I do like having the kernel figure out what's in the best interests of the
system as a whole.  Apps don't have enough information, and while they
can provide hints, they're often wrong.  So, let's say an app maps 8GB
of anonymous memory.  As the app accesses it, we should probably start
by allocating 4kB pages to back that memory.  As time goes on and that
memory continues to be accessed and more memory is accessed, it makes
sense to keep track of that, replacing the existing 4kB pages with, say,
16-64kB pages and allocating newly accessed memory with larger pages.
Eventually that should grow to 2MB allocations and PMD mappings.
And then continue on, all the way to 1GB pages.

We also need to be able to figure out that it's not being effective
any more.  One of the issues with tracing accessed/dirty at the 1GB level
is that writing an entire 1GB page is going to take 0.25 seconds on a x4
gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
1GB pages effectively today, we need to fragment them before choosing to
swap them out (*)  Maybe that's the point where we can start to say "OK,
this sized mapping might not be effective any more".  On the other hand,
that might not work for some situations.  Imagine, eg, a matrix multiply
(everybody's favourite worst-case scenario).  C = A * B where each of A,
B and C is too large to fit in DRAM.  There are going to be points of the
calculation where each element of A is going to be walked sequentially,
and so it'd be nice to use larger PTEs to map it, but then we need to
destroy that almost immediately to allow other things to use the memory.


I think I'm leaning towards not merging this patchset yet.  I'm in
agreement with the goals (allowing systems to use PUD-sized pages
automatically), but I think we need to improve the infrastructure to
make it work well automatically.  Does that make sense?

(*) It would be nice if hardware provided a way to track D/A on a sub-PTE
level when using PMD/PUD sized mappings.  I don't know of any that does
that today.
Roman Gushchin Oct. 5, 2020, 5:04 p.m. UTC | #10
On Mon, Oct 05, 2020 at 04:55:53PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> > On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> > > Yes, I think one important feature would be that we don't end up placing
> > > a gigantic page where only a handful of pages are actually populated
> > > without green light from the application - because that's what some user
> > > space applications care about (not consuming more memory than intended.
> > > IIUC, this is also what this patch set does). I'm fine with placing
> > > gigantic pages if it really just "defragments" the address space layout,
> > > without filling unpopulated holes.
> > >
> > > Then, this would be mostly invisible to user space, and we really
> > > wouldn't have to care about any configuration.
> > 
> > I agree that the interface should be as simple as no configuration to
> > most users. But I also wonder why we have hugetlbfs to allow users to
> > specify different kinds of page sizes, which seems against the discussion
> > above. Are we assuming advanced users should always use hugetlbfs instead
> > of THPs?
> 
> Evolution doesn't always produce the best outcomes ;-)
> 
> A perennial mistake we've made is "Oh, this is a strange & new & weird
> feature that most applications will never care about, let's put it in
> hugetlbfs where nobody will notice and we don't have to think about it
> in the core VM"
> 
> And then what was initially strange & new & weird gradually becomes
> something that most applications just want to have happen automatically,
> and telling them all to go use hugetlbfs becomes untenable, so we move
> the feature into the core VM.
> 
> It is absurd that my phone is attempting to manage a million 4kB pages.
> I think even trying to manage a quarter-million 16kB pages is too much
> work, and really it would be happier managing 65,000 64kB pages.
> 
> Extend that into the future a decade or two, and we'll be expecting
> that it manages memory in megabyte sized units and uses PMD and PUD
> mappings by default.  PTE mappings will still be used, but very much
> on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
> smaller pages to not waste too much memory when mapping it" basis.  So,
> yeah, PUD sized mappings have problems today, but we should be writing
> software now so a Pixel 15 in a decade can boot a kernel built five
> years from now and have PUD mappings Just Work without requiring the
> future userspace programmer to "use hugetlbfs".
> 
> One of the longer-term todo items is to support variable sized THPs for
> anonymous memory, just like I've done for the pagecache.  With that in
> place, I think scaling up from PMD sized pages to PUD sized pages starts
> to look more natural.  Itanium and PA-RISC (two architectures that will
> never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> The RiscV spec you pointed me at the other day confines itself to adding
> support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> sizes would be possible additions in the future.

+1

> But, back to today, what to do with this patchset?  Even on my 16GB
> laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
> page is ever the right decision to make.  But my laptop runs a "mixed"
> workload, and if you could convince me that Firefox would run 10% faster
> by using a 1GB page as its in-memory cache, well, I'd be sold.
> 
> I do like having the kernel figure out what's in the best interests of the
> system as a whole.  Apps don't have enough information, and while they
> can provide hints, they're often wrong.

It's definitely true for many cases, but not true for some other cases.

For example, we're running hhvm ( https://hhvm.com/ ) on a large number
of machines. Hhvm is known to have a significant performance benefit
when using hugepages. Exact numbers depend on the exact workload and
configuration, but there is a noticeable difference (in single digits of
percents) between using 4k pages only, 4k pages and 2MB pages, and
4k, 2MB and some 1GB pages.

As now, we have to use hugetlbfs mostly because of the lack of 1GB THP support.
It has some significant downsides: e.g. hugetlb memory is not properly accounted
on a memory cgroup level, it requires additional "management", etc.
If we could allocate 1GB THPs with something like new madvise,
having all memcg stats working and destroy them transparently on the application
exit, it's already valuable.

> So, let's say an app maps 8GB
> of anonymous memory.  As the app accesses it, we should probably start
> by allocating 4kB pages to back that memory.  As time goes on and that
> memory continues to be accessed and more memory is accessed, it makes
> sense to keep track of that, replacing the existing 4kB pages with, say,
> 16-64kB pages and allocating newly accessed memory with larger pages.
> Eventually that should grow to 2MB allocations and PMD mappings.
> And then continue on, all the way to 1GB pages.
> 
> We also need to be able to figure out that it's not being effective
> any more.  One of the issues with tracing accessed/dirty at the 1GB level
> is that writing an entire 1GB page is going to take 0.25 seconds on a x4
> gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
> 1GB pages effectively today, we need to fragment them before choosing to
> swap them out (*)  Maybe that's the point where we can start to say "OK,
> this sized mapping might not be effective any more".  On the other hand,
> that might not work for some situations.  Imagine, eg, a matrix multiply
> (everybody's favourite worst-case scenario).  C = A * B where each of A,
> B and C is too large to fit in DRAM.  There are going to be points of the
> calculation where each element of A is going to be walked sequentially,
> and so it'd be nice to use larger PTEs to map it, but then we need to
> destroy that almost immediately to allow other things to use the memory.
> 
> 
> I think I'm leaning towards not merging this patchset yet.

Please, correct me if I'm wrong, but in my understanding the effort
required for a proper 1 GB THP support can be roughly split into two parts:
1) technical support of PUD-sized THPs,
2) heuristics to create and destroy them automatically .

The second part will likely require a lot of experimenting and fine-tuning
and obviously depends on the working part 1. So I don't see why we should
postpone the part 1, if only it doesn't add too much overhead (which is not
the case, right?). If the problem is the introduction of a semi-dead code,
we can put it under a config option (I would prefer to not do it though).

> I'm in
> agreement with the goals (allowing systems to use PUD-sized pages
> automatically), but I think we need to improve the infrastructure to
> make it work well automatically.  Does that make sense?

Is there a plan for this? How can we make sure there we're making a forward
progress here?

Thank you!
Roman Gushchin Oct. 5, 2020, 5:16 p.m. UTC | #11
On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> 
> > On 02.10.20 10:10, Michal Hocko wrote:
> >> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>> - huge page sizes controllable by the userspace?
> >>>>>
> >>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>> have better control of their applications.
> >>>>
> >>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>> They get a very good control over page size and pool preallocation etc.
> >>>> So they can get what they need - assuming there is enough memory.
> >>>>
> >>>
> >>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>> to support. I can understand that there are some use cases that might
> >>> benefit from it, especially:
> >>
> >> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >> that can transparently split under memory pressure is a useful
> >> funtionality. I cannot really judge how complex that would be
> >
> > Right, but that's then something different than serving (scarce,
> > unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> > wrong about *real* THP support, meaning, e.g., grouping consecutive
> > pages and converting them back and forth on demand. (E.g., 1GB ->
> > multiple 2MB -> multiple single pages), for example, when having to
> > migrate such a gigantic page. But that's very different from our
> > existing gigantic page code as far as I can tell.
> 
> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> which needs section size increase. In addition, unmoveable pages cannot
> be allocated in CMA, so allocating 1GB pages has much higher chance from
> it than from ZONE_NORMAL.

s/higher chances/non-zero chances

Currently we have nothing that prevents the fragmentation of the memory
with unmovable pages on the 1GB scale. It means that in a common case
it's highly unlikely to find a continuous GB without any unmovable page.
As now CMA seems to be the only working option.

However it seems there are other use cases for the allocation of continuous
1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using
1GB pages can reduce the fragmentation of the direct mapping.

So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
E.g. something like a second level of pageblocks. That would allow to group
all unmovable memory in few 1GB blocks and have more 1GB regions available for
gigantic THPs and other use cases. I'm looking now into how it can be done.
If anybody has any ideas here, I'll appreciate a lot.

Thanks!
David Hildenbrand Oct. 5, 2020, 5:27 p.m. UTC | #12
On 05.10.20 19:16, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>
>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>
>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>> have better control of their applications.
>>>>>>
>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>
>>>>>
>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>> to support. I can understand that there are some use cases that might
>>>>> benefit from it, especially:
>>>>
>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>> that can transparently split under memory pressure is a useful
>>>> funtionality. I cannot really judge how complex that would be
>>>
>>> Right, but that's then something different than serving (scarce,
>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>> multiple 2MB -> multiple single pages), for example, when having to
>>> migrate such a gigantic page. But that's very different from our
>>> existing gigantic page code as far as I can tell.
>>
>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>> which needs section size increase. In addition, unmoveable pages cannot
>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>> it than from ZONE_NORMAL.
> 
> s/higher chances/non-zero chances

Well, the longer the system runs (and consumes a significant amount of
available main memory), the less likely it is.

> 
> Currently we have nothing that prevents the fragmentation of the memory
> with unmovable pages on the 1GB scale. It means that in a common case
> it's highly unlikely to find a continuous GB without any unmovable page.
> As now CMA seems to be the only working option.
> 

And I completely dislike the use of CMA in this context (for example,
allocating via CMA and freeing via the buddy by patching CMA when
splitting up PUDs ...).

> However it seems there are other use cases for the allocation of continuous
> 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using
> 1GB pages can reduce the fragmentation of the direct mapping.

Yes, see RFC v1 where I already cced Mike.

> 
> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> E.g. something like a second level of pageblocks. That would allow to group
> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> gigantic THPs and other use cases. I'm looking now into how it can be done.

Anything bigger than sections is somewhat problematic: you have to track
that data somewhere. It cannot be the section (in contrast to pageblocks)

> If anybody has any ideas here, I'll appreciate a lot.

I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
somewhat mimics what CMA does (when sized reasonably), works well with
memory hot(un)plug, and is immune to misconfiguration. Within such a
zone, we can try to optimize the placement of larger blocks.
David Hildenbrand Oct. 5, 2020, 5:30 p.m. UTC | #13
>> I think gigantic pages are a sparse resource. Only selected applications
>> *really* depend on them and benefit from them. Let these special
>> applications handle it explicitly.
>>
>> Can we have a summary of use cases that would really benefit from this
>> change?
> 
> For large machine learning applications, 1GB pages give good performance boost[2].
> NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not
> that sparse in GPU-equipped infrastructure[3].

Well, they *are* sparse and there are absolutely no grantees until you
reserve them via CMA, which is just plain ugly IMHO.

In the same setup, you can most probably use hugetlbfs and achieve a
similar result. Not saying it is very user-friendly.
David Hildenbrand Oct. 5, 2020, 5:39 p.m. UTC | #14
>>> consideting that 2MB THP have turned out to be quite a pain but
>>> situation has settled over time. Maybe our current code base is prepared
>>> for that much better.
> 
> I am planning to refactor my code further to reduce the amount of
> the added code, since PUD THP is very similar to PMD THP. One thing
> I want to achieve is to enable split_huge_page to split any order of
> pages to a group of any lower order of pages. A lot of code in this
> patchset is replicating the same behavior of PMD THP at PUD level.
> It might be possible to deduplicate most of the code.
> 
>>>
>>> Exposing that interface to the userspace is a different story of course.
>>> I do agree that we likely do not want to be very explicit about that.
>>> E.g. an interface for address space defragmentation without any more
>>> specifics sounds like a useful feature to me. It will be up to the
>>> kernel to decide which huge pages to use.
>>
>> Yes, I think one important feature would be that we don't end up placing
>> a gigantic page where only a handful of pages are actually populated
>> without green light from the application - because that's what some user
>> space applications care about (not consuming more memory than intended.
>> IIUC, this is also what this patch set does). I'm fine with placing
>> gigantic pages if it really just "defragments" the address space layout,
>> without filling unpopulated holes.
>>
>> Then, this would be mostly invisible to user space, and we really
>> wouldn't have to care about any configuration.
> 
> 
> I agree that the interface should be as simple as no configuration to
> most users. But I also wonder why we have hugetlbfs to allow users to
> specify different kinds of page sizes, which seems against the discussion
> above. Are we assuming advanced users should always use hugetlbfs instead
> of THPs?

Well, with hugetlbfs you get a real control over which pagesizes to use.
No mixture, guarantees.

In some environments you might want to control which application gets
which pagesize. I know of database applications and hypervisors that
sometimes really want 2MB huge pages instead of 1GB huge pages. And
sometimes you really want/need 1GB huge pages (e.g., low-latency
applications, real-time KVM, ...).

Simple example: KVM with postcopy live migration

While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
on demand (via userfaultdfd) is a painfully slow / impractical.
Zi Yan Oct. 5, 2020, 6:05 p.m. UTC | #15
On 5 Oct 2020, at 13:39, David Hildenbrand wrote:

>>>> consideting that 2MB THP have turned out to be quite a pain but
>>>> situation has settled over time. Maybe our current code base is prepared
>>>> for that much better.
>>
>> I am planning to refactor my code further to reduce the amount of
>> the added code, since PUD THP is very similar to PMD THP. One thing
>> I want to achieve is to enable split_huge_page to split any order of
>> pages to a group of any lower order of pages. A lot of code in this
>> patchset is replicating the same behavior of PMD THP at PUD level.
>> It might be possible to deduplicate most of the code.
>>
>>>>
>>>> Exposing that interface to the userspace is a different story of course.
>>>> I do agree that we likely do not want to be very explicit about that.
>>>> E.g. an interface for address space defragmentation without any more
>>>> specifics sounds like a useful feature to me. It will be up to the
>>>> kernel to decide which huge pages to use.
>>>
>>> Yes, I think one important feature would be that we don't end up placing
>>> a gigantic page where only a handful of pages are actually populated
>>> without green light from the application - because that's what some user
>>> space applications care about (not consuming more memory than intended.
>>> IIUC, this is also what this patch set does). I'm fine with placing
>>> gigantic pages if it really just "defragments" the address space layout,
>>> without filling unpopulated holes.
>>>
>>> Then, this would be mostly invisible to user space, and we really
>>> wouldn't have to care about any configuration.
>>
>>
>> I agree that the interface should be as simple as no configuration to
>> most users. But I also wonder why we have hugetlbfs to allow users to
>> specify different kinds of page sizes, which seems against the discussion
>> above. Are we assuming advanced users should always use hugetlbfs instead
>> of THPs?
>
> Well, with hugetlbfs you get a real control over which pagesizes to use.
> No mixture, guarantees.
>
> In some environments you might want to control which application gets
> which pagesize. I know of database applications and hypervisors that
> sometimes really want 2MB huge pages instead of 1GB huge pages. And
> sometimes you really want/need 1GB huge pages (e.g., low-latency
> applications, real-time KVM, ...).
>
> Simple example: KVM with postcopy live migration
>
> While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> on demand (via userfaultdfd) is a painfully slow / impractical.


The real control of hugetlbfs comes from the interfaces provided by
the kernel. If kernel provides similar interfaces to control page sizes
of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
comes from system memory fragmentation and hugetlbfs does not have this
mixture because of its special allocation pools not because of the code
itself. If THPs are allocated from the same pools, they would act
the same as hugetlbfs. What am I missing here?

I just do not get why hugetlbfs is so special that it can have pagesize
fine control when normal pages cannot get. The “it should be invisible
to userpsace” argument suddenly does not hold for hugetlbfs.


—
Best Regards,
Yan Zi
Roman Gushchin Oct. 5, 2020, 6:25 p.m. UTC | #16
On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> On 05.10.20 19:16, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>
> >>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>
> >>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>> have better control of their applications.
> >>>>>>
> >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>
> >>>>>
> >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>> to support. I can understand that there are some use cases that might
> >>>>> benefit from it, especially:
> >>>>
> >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>> that can transparently split under memory pressure is a useful
> >>>> funtionality. I cannot really judge how complex that would be
> >>>
> >>> Right, but that's then something different than serving (scarce,
> >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>> multiple 2MB -> multiple single pages), for example, when having to
> >>> migrate such a gigantic page. But that's very different from our
> >>> existing gigantic page code as far as I can tell.
> >>
> >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >> which needs section size increase. In addition, unmoveable pages cannot
> >> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >> it than from ZONE_NORMAL.
> > 
> > s/higher chances/non-zero chances
> 
> Well, the longer the system runs (and consumes a significant amount of
> available main memory), the less likely it is.
> 
> > 
> > Currently we have nothing that prevents the fragmentation of the memory
> > with unmovable pages on the 1GB scale. It means that in a common case
> > it's highly unlikely to find a continuous GB without any unmovable page.
> > As now CMA seems to be the only working option.
> > 
> 
> And I completely dislike the use of CMA in this context (for example,
> allocating via CMA and freeing via the buddy by patching CMA when
> splitting up PUDs ...).
> 
> > However it seems there are other use cases for the allocation of continuous
> > 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
> > 1GB pages can reduce the fragmentation of the direct mapping.
> 
> Yes, see RFC v1 where I already cced Mike.
> 
> > 
> > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> > E.g. something like a second level of pageblocks. That would allow to group
> > all unmovable memory in few 1GB blocks and have more 1GB regions available for
> > gigantic THPs and other use cases. I'm looking now into how it can be done.
> 
> Anything bigger than sections is somewhat problematic: you have to track
> that data somewhere. It cannot be the section (in contrast to pageblocks)

Well, it's not a large amount of data: the number of 1GB regions is not that
high even on very large machines.

> 
> > If anybody has any ideas here, I'll appreciate a lot.
> 
> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> somewhat mimics what CMA does (when sized reasonably), works well with
> memory hot(un)plug, and is immune to misconfiguration. Within such a
> zone, we can try to optimize the placement of larger blocks.

Thank you for pointing at it!

The main problem with it is the same as with ZONE_MOVABLE: it does require
a boot-time educated guess on a good size. I admit that the CMA does too.

But I really hope that a long-term solution will not require a pre-configuration.
I do not see why fundamentally we can't group unmovable allocations in (few)
1GB regions. Basically all we need to do is to choose a nearby 2MB block if we
don't have enough free pages in the unmovable free list and going to steal a new
2MB block. I know, it doesn't work this way, but just as an illustration.
In the reality, when stealing a block, under some conditions we might want
to steal the whole 1GB region. In this case the following unmovable allocations
will not lead to stealing of new blocks from (potentially) different 1GB regions.
I have no working code yet, just thinking into this direction.

Thanks!
David Hildenbrand Oct. 5, 2020, 6:33 p.m. UTC | #17
On 05.10.20 20:25, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
>> On 05.10.20 19:16, Roman Gushchin wrote:
>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>>>
>>>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>>>
>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>>>> have better control of their applications.
>>>>>>>>
>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>>>
>>>>>>>
>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>>>> to support. I can understand that there are some use cases that might
>>>>>>> benefit from it, especially:
>>>>>>
>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>>>> that can transparently split under memory pressure is a useful
>>>>>> funtionality. I cannot really judge how complex that would be
>>>>>
>>>>> Right, but that's then something different than serving (scarce,
>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>>>> multiple 2MB -> multiple single pages), for example, when having to
>>>>> migrate such a gigantic page. But that's very different from our
>>>>> existing gigantic page code as far as I can tell.
>>>>
>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>>>> which needs section size increase. In addition, unmoveable pages cannot
>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>>>> it than from ZONE_NORMAL.
>>>
>>> s/higher chances/non-zero chances
>>
>> Well, the longer the system runs (and consumes a significant amount of
>> available main memory), the less likely it is.
>>
>>>
>>> Currently we have nothing that prevents the fragmentation of the memory
>>> with unmovable pages on the 1GB scale. It means that in a common case
>>> it's highly unlikely to find a continuous GB without any unmovable page.
>>> As now CMA seems to be the only working option.
>>>
>>
>> And I completely dislike the use of CMA in this context (for example,
>> allocating via CMA and freeing via the buddy by patching CMA when
>> splitting up PUDs ...).
>>
>>> However it seems there are other use cases for the allocation of continuous
>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
>>> 1GB pages can reduce the fragmentation of the direct mapping.
>>
>> Yes, see RFC v1 where I already cced Mike.
>>
>>>
>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
>>> E.g. something like a second level of pageblocks. That would allow to group
>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
>>> gigantic THPs and other use cases. I'm looking now into how it can be done.
>>
>> Anything bigger than sections is somewhat problematic: you have to track
>> that data somewhere. It cannot be the section (in contrast to pageblocks)
> 
> Well, it's not a large amount of data: the number of 1GB regions is not that
> high even on very large machines.

Yes, but then you can have very sparse systems. And some use cases would
actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
optimizing memory efficiency by turning off banks and such ...

> 
>>
>>> If anybody has any ideas here, I'll appreciate a lot.
>>
>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
>> somewhat mimics what CMA does (when sized reasonably), works well with
>> memory hot(un)plug, and is immune to misconfiguration. Within such a
>> zone, we can try to optimize the placement of larger blocks.
> 
> Thank you for pointing at it!
> 
> The main problem with it is the same as with ZONE_MOVABLE: it does require
> a boot-time educated guess on a good size. I admit that the CMA does too.

"Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
highmem times) ares usually perfectly fine. And if you mess up - in
comparison to CMA - you won't shoot yourself in the foot, you get less
gigantic pages - which is usually better than before. I consider that a
clear win. Perfect? No. Can we be perfect? unlikely.

In comparison to CMA / ZONE_MOVABLE, a bad guess won't cause instabilities.
David Hildenbrand Oct. 5, 2020, 6:48 p.m. UTC | #18
> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code

With hugeltbfs, you have a guarantee that all pages within your VMA have
the same page size. This is an important property. With THP you have the
guarantee that any page can be operated on, as if it would be base-page
granularity.

Example: KVM on s390x

a) It cannot deal with THP. If you supply THP, the kernel will simply
split up all THP and prohibit new ones from getting formed. All works
well (well, no speedup because no THP).
b) It can deal with 1MB huge pages (in some configurations).
c) It cannot deal with 2G huge pages.

So user space really has to control which pagesize to use in case of
hugetlbfs.

> itself. If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

Did I mention that I dislike taking THP from the CMA pool? ;)

> 
> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

It's not about "cannot get", it's about "do we need it". We do have a
trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient.


The name "Transparent" implies that they *should* be transparent to user
space. This, unfortunately, is not completely true:

1. Performance aspects: Breaking up THP is bad for performance. This can
be observed fairly easily by when using 4k-based memory ballooning in
virtualized environments. If we stick to the current THP size (e.g.,
2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required
 is completely acceptable.

2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat
acceptable / controllable. Touch 4K, get 1G populated is not desirable.
And I think we mostly agree that we should operate only on
fully-populated ranges to replace by 1G THP.


But then, there is no observerable difference between 1G THP and 2M THP
from user space point of view except performance.

So we are debating about "Should the kernel tell us that we can use 1G
THP for a VMA".  What if we were suddenly to support 2G THP (look at
arm64 how they support all kinds of huge pages for hugetlbfs)? Do we
really need *another* trigger?

What Michal proposed (IIUC) is rather user space telling the kernel
"this large memory range here is *really* important for performance,
please try to optimize the memory layout, give me the best you've got".

MADV_HUGEPAGE_1GB is just ugly.
Roman Gushchin Oct. 5, 2020, 7:11 p.m. UTC | #19
On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
> On 05.10.20 20:25, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> >> On 05.10.20 19:16, Roman Gushchin wrote:
> >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>>>
> >>>>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>>>
> >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>>>> have better control of their applications.
> >>>>>>>>
> >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>>>> to support. I can understand that there are some use cases that might
> >>>>>>> benefit from it, especially:
> >>>>>>
> >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>>>> that can transparently split under memory pressure is a useful
> >>>>>> funtionality. I cannot really judge how complex that would be
> >>>>>
> >>>>> Right, but that's then something different than serving (scarce,
> >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>>>> multiple 2MB -> multiple single pages), for example, when having to
> >>>>> migrate such a gigantic page. But that's very different from our
> >>>>> existing gigantic page code as far as I can tell.
> >>>>
> >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >>>> which needs section size increase. In addition, unmoveable pages cannot
> >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >>>> it than from ZONE_NORMAL.
> >>>
> >>> s/higher chances/non-zero chances
> >>
> >> Well, the longer the system runs (and consumes a significant amount of
> >> available main memory), the less likely it is.
> >>
> >>>
> >>> Currently we have nothing that prevents the fragmentation of the memory
> >>> with unmovable pages on the 1GB scale. It means that in a common case
> >>> it's highly unlikely to find a continuous GB without any unmovable page.
> >>> As now CMA seems to be the only working option.
> >>>
> >>
> >> And I completely dislike the use of CMA in this context (for example,
> >> allocating via CMA and freeing via the buddy by patching CMA when
> >> splitting up PUDs ...).
> >>
> >>> However it seems there are other use cases for the allocation of continuous
> >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
> >>> 1GB pages can reduce the fragmentation of the direct mapping.
> >>
> >> Yes, see RFC v1 where I already cced Mike.
> >>
> >>>
> >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> >>> E.g. something like a second level of pageblocks. That would allow to group
> >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> >>> gigantic THPs and other use cases. I'm looking now into how it can be done.
> >>
> >> Anything bigger than sections is somewhat problematic: you have to track
> >> that data somewhere. It cannot be the section (in contrast to pageblocks)
> > 
> > Well, it's not a large amount of data: the number of 1GB regions is not that
> > high even on very large machines.
> 
> Yes, but then you can have very sparse systems. And some use cases would
> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
> optimizing memory efficiency by turning off banks and such ...

It's a definitely a good question.

> > 
> >>
> >>> If anybody has any ideas here, I'll appreciate a lot.
> >>
> >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> >> somewhat mimics what CMA does (when sized reasonably), works well with
> >> memory hot(un)plug, and is immune to misconfiguration. Within such a
> >> zone, we can try to optimize the placement of larger blocks.
> > 
> > Thank you for pointing at it!
> > 
> > The main problem with it is the same as with ZONE_MOVABLE: it does require
> > a boot-time educated guess on a good size. I admit that the CMA does too.
> 
> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
> highmem times) ares usually perfectly fine. And if you mess up - in
> comparison to CMA - you won't shoot yourself in the foot, you get less
> gigantic pages - which is usually better than before. I consider that a
> clear win. Perfect? No. Can we be perfect? unlikely.

I'm not necessarily opposing your idea, I just think it will be tricky
to not introduce an additional overhead if the ratio is not perfectly
chosen. And there is simple a cost of adding a zone.

But fundamentally we're speaking about the same thing: grouping pages
by their movability on a smaller scale. With a new zone we'll split
pages into two parts with a fixed border, with new pageblock layer
in 1GB blocks.

I think the agreement is that we need such functionality.

Thanks!
Zi Yan Oct. 5, 2020, 7:12 p.m. UTC | #20
On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:

> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>> Yes, I think one important feature would be that we don't end up placing
>>> a gigantic page where only a handful of pages are actually populated
>>> without green light from the application - because that's what some user
>>> space applications care about (not consuming more memory than intended.
>>> IIUC, this is also what this patch set does). I'm fine with placing
>>> gigantic pages if it really just "defragments" the address space layout,
>>> without filling unpopulated holes.
>>>
>>> Then, this would be mostly invisible to user space, and we really
>>> wouldn't have to care about any configuration.
>>
>> I agree that the interface should be as simple as no configuration to
>> most users. But I also wonder why we have hugetlbfs to allow users to
>> specify different kinds of page sizes, which seems against the discussion
>> above. Are we assuming advanced users should always use hugetlbfs instead
>> of THPs?
>
> Evolution doesn't always produce the best outcomes ;-)
>
> A perennial mistake we've made is "Oh, this is a strange & new & weird
> feature that most applications will never care about, let's put it in
> hugetlbfs where nobody will notice and we don't have to think about it
> in the core VM"
>
> And then what was initially strange & new & weird gradually becomes
> something that most applications just want to have happen automatically,
> and telling them all to go use hugetlbfs becomes untenable, so we move
> the feature into the core VM.
>
> It is absurd that my phone is attempting to manage a million 4kB pages.
> I think even trying to manage a quarter-million 16kB pages is too much
> work, and really it would be happier managing 65,000 64kB pages.
>
> Extend that into the future a decade or two, and we'll be expecting
> that it manages memory in megabyte sized units and uses PMD and PUD
> mappings by default.  PTE mappings will still be used, but very much
> on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
> smaller pages to not waste too much memory when mapping it" basis.  So,
> yeah, PUD sized mappings have problems today, but we should be writing
> software now so a Pixel 15 in a decade can boot a kernel built five
> years from now and have PUD mappings Just Work without requiring the
> future userspace programmer to "use hugetlbfs".

I agree.

>
> One of the longer-term todo items is to support variable sized THPs for
> anonymous memory, just like I've done for the pagecache.  With that in
> place, I think scaling up from PMD sized pages to PUD sized pages starts
> to look more natural.  Itanium and PA-RISC (two architectures that will
> never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> The RiscV spec you pointed me at the other day confines itself to adding
> support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> sizes would be possible additions in the future.

Just to understand the todo items clearly. With your pagecache patchset,
kernel should be able to understand variable sized THPs no matter they
are anonymous or not, right? For anonymous memory, we need kernel policies
to decide what THP sizes to use at allocation, what to do when under
memory pressure, and so on. In terms of implementation, THP split function
needs to support from any order to any lower order. Anything I am missing here?

>
> But, back to today, what to do with this patchset?  Even on my 16GB
> laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
> page is ever the right decision to make.  But my laptop runs a "mixed"
> workload, and if you could convince me that Firefox would run 10% faster
> by using a 1GB page as its in-memory cache, well, I'd be sold.
>
> I do like having the kernel figure out what's in the best interests of the
> system as a whole.  Apps don't have enough information, and while they
> can provide hints, they're often wrong.  So, let's say an app maps 8GB
> of anonymous memory.  As the app accesses it, we should probably start
> by allocating 4kB pages to back that memory.  As time goes on and that
> memory continues to be accessed and more memory is accessed, it makes
> sense to keep track of that, replacing the existing 4kB pages with, say,
> 16-64kB pages and allocating newly accessed memory with larger pages.
> Eventually that should grow to 2MB allocations and PMD mappings.
> And then continue on, all the way to 1GB pages.
>
> We also need to be able to figure out that it's not being effective
> any more.  One of the issues with tracing accessed/dirty at the 1GB level
> is that writing an entire 1GB page is going to take 0.25 seconds on a x4
> gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
> 1GB pages effectively today, we need to fragment them before choosing to
> swap them out (*)  Maybe that's the point where we can start to say "OK,
> this sized mapping might not be effective any more".  On the other hand,
> that might not work for some situations.  Imagine, eg, a matrix multiply
> (everybody's favourite worst-case scenario).  C = A * B where each of A,
> B and C is too large to fit in DRAM.  There are going to be points of the
> calculation where each element of A is going to be walked sequentially,
> and so it'd be nice to use larger PTEs to map it, but then we need to
> destroy that almost immediately to allow other things to use the memory.
>
>
> I think I'm leaning towards not merging this patchset yet.  I'm in
> agreement with the goals (allowing systems to use PUD-sized pages
> automatically), but I think we need to improve the infrastructure to
> make it work well automatically.  Does that make sense?

I agree that this patchset should not be merged in the current form.
I think PUD THP support is a part of variable sized THP support, but
current form of the patchset does not have the “variable sized THP”
spirit yet and is more like a special PUD case support. I guess some
changes to existing THP code to make PUD THP less a special case would
make the whole patchset more acceptable?

Can you elaborate more on the infrastructure part? Thanks.

>
> (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> level when using PMD/PUD sized mappings.  I don't know of any that does
> that today.

I agree it would be a nice hardware feature, but it also has a high cost.
Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
assuming each entry takes 8B space. Now it becomes why not having a bigger
TLB. ;)



—
Best Regards,
Yan Zi
Matthew Wilcox (Oracle) Oct. 5, 2020, 7:37 p.m. UTC | #21
On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote:
> On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:
> > One of the longer-term todo items is to support variable sized THPs for
> > anonymous memory, just like I've done for the pagecache.  With that in
> > place, I think scaling up from PMD sized pages to PUD sized pages starts
> > to look more natural.  Itanium and PA-RISC (two architectures that will
> > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> > The RiscV spec you pointed me at the other day confines itself to adding
> > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> > sizes would be possible additions in the future.
> 
> Just to understand the todo items clearly. With your pagecache patchset,
> kernel should be able to understand variable sized THPs no matter they
> are anonymous or not, right?

... yes ... modulo bugs and places I didn't fix because only anonymous
pages can get there ;-)  There are still quite a few references to
HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related
to things which are actually PMD sized.  I did fix a couple of places
where the anonymous path assumed that pages were PMD sized because I
thought we'd probably want to do that sooner rather than later.

> For anonymous memory, we need kernel policies
> to decide what THP sizes to use at allocation, what to do when under
> memory pressure, and so on. In terms of implementation, THP split function
> needs to support from any order to any lower order. Anything I am missing here?

I think that's the bulk of the work.  The swap code also needs work so we
don't have to split pages to swap them out.

> > I think I'm leaning towards not merging this patchset yet.  I'm in
> > agreement with the goals (allowing systems to use PUD-sized pages
> > automatically), but I think we need to improve the infrastructure to
> > make it work well automatically.  Does that make sense?
> 
> I agree that this patchset should not be merged in the current form.
> I think PUD THP support is a part of variable sized THP support, but
> current form of the patchset does not have the “variable sized THP”
> spirit yet and is more like a special PUD case support. I guess some
> changes to existing THP code to make PUD THP less a special case would
> make the whole patchset more acceptable?
> 
> Can you elaborate more on the infrastructure part? Thanks.

Oh, this paragraph was just summarising the above.  We need to
be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc.
I haven't put much effort yet into supporting pages which are larger than
PMD-size -- that is, if a page is mapped with a PMD entry, we assume
it's PMD-sized.  Once we can allocate a larger-than-PMD sized page,
that's off.  I assume a lot of that is dealt with in your patchset,
although I haven't audited it to check for that.

> > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> > level when using PMD/PUD sized mappings.  I don't know of any that does
> > that today.
> 
> I agree it would be a nice hardware feature, but it also has a high cost.
> Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
> assuming each entry takes 8B space. Now it becomes why not having a bigger
> TLB. ;)

Oh, we don't have to track at the individual-page level for this to be
useful.  Let's take the RISC-V Sv39 page table entry format as an example:

63-54 attributes
53-28 PPN2
27-19 PPN1
18-10 PPN0
9-8 RSW
7-0 DAGUXWRV

For a 2MB page, we currently insist that 18-10 are zero.  If we repurpose
eight of those nine bits as A/D bits, we can track at 512kB granularity.
For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB
granularity.  It's not great, but it is quite cheap!
David Hildenbrand Oct. 6, 2020, 8:25 a.m. UTC | #22
On 05.10.20 21:11, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
>> On 05.10.20 20:25, Roman Gushchin wrote:
>>> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
>>>> On 05.10.20 19:16, Roman Gushchin wrote:
>>>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>>>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>>>>>
>>>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>>>>>> have better control of their applications.
>>>>>>>>>>
>>>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>>>>>> to support. I can understand that there are some use cases that might
>>>>>>>>> benefit from it, especially:
>>>>>>>>
>>>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>>>>>> that can transparently split under memory pressure is a useful
>>>>>>>> funtionality. I cannot really judge how complex that would be
>>>>>>>
>>>>>>> Right, but that's then something different than serving (scarce,
>>>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>>>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>>>>>> multiple 2MB -> multiple single pages), for example, when having to
>>>>>>> migrate such a gigantic page. But that's very different from our
>>>>>>> existing gigantic page code as far as I can tell.
>>>>>>
>>>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>>>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>>>>>> which needs section size increase. In addition, unmoveable pages cannot
>>>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>>>>>> it than from ZONE_NORMAL.
>>>>>
>>>>> s/higher chances/non-zero chances
>>>>
>>>> Well, the longer the system runs (and consumes a significant amount of
>>>> available main memory), the less likely it is.
>>>>
>>>>>
>>>>> Currently we have nothing that prevents the fragmentation of the memory
>>>>> with unmovable pages on the 1GB scale. It means that in a common case
>>>>> it's highly unlikely to find a continuous GB without any unmovable page.
>>>>> As now CMA seems to be the only working option.
>>>>>
>>>>
>>>> And I completely dislike the use of CMA in this context (for example,
>>>> allocating via CMA and freeing via the buddy by patching CMA when
>>>> splitting up PUDs ...).
>>>>
>>>>> However it seems there are other use cases for the allocation of continuous
>>>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
>>>>> 1GB pages can reduce the fragmentation of the direct mapping.
>>>>
>>>> Yes, see RFC v1 where I already cced Mike.
>>>>
>>>>>
>>>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
>>>>> E.g. something like a second level of pageblocks. That would allow to group
>>>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
>>>>> gigantic THPs and other use cases. I'm looking now into how it can be done.
>>>>
>>>> Anything bigger than sections is somewhat problematic: you have to track
>>>> that data somewhere. It cannot be the section (in contrast to pageblocks)
>>>
>>> Well, it's not a large amount of data: the number of 1GB regions is not that
>>> high even on very large machines.
>>
>> Yes, but then you can have very sparse systems. And some use cases would
>> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
>> optimizing memory efficiency by turning off banks and such ...
> 
> It's a definitely a good question.

Oh, and I forgot that there might be users that want bigger granularity
:) (primarily, memory hotunplug that wants to avoid ZONE_MOVABLE  but
still have higher chances to eventually unplug some memory)

> 
>>>
>>>>
>>>>> If anybody has any ideas here, I'll appreciate a lot.
>>>>
>>>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
>>>> somewhat mimics what CMA does (when sized reasonably), works well with
>>>> memory hot(un)plug, and is immune to misconfiguration. Within such a
>>>> zone, we can try to optimize the placement of larger blocks.
>>>
>>> Thank you for pointing at it!
>>>
>>> The main problem with it is the same as with ZONE_MOVABLE: it does require
>>> a boot-time educated guess on a good size. I admit that the CMA does too.
>>
>> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
>> highmem times) ares usually perfectly fine. And if you mess up - in
>> comparison to CMA - you won't shoot yourself in the foot, you get less
>> gigantic pages - which is usually better than before. I consider that a
>> clear win. Perfect? No. Can we be perfect? unlikely.
> 
> I'm not necessarily opposing your idea, I just think it will be tricky
> to not introduce an additional overhead if the ratio is not perfectly
> chosen. And there is simple a cost of adding a zone.

Not sure this will be really visible - and if your kernel requires more
than 20%..50% unmovable data than something is usually really
fishy/special. The nice thing is that Linux will try to "auto-optimize"
within each zone already.

My gut feeling is that it's way easier to teach Linux (add zone, add
mmop_type, build zonelists, split memory similar to movablecore) -
however, that doesn't imply that it's better. We'll have to see.

> 
> But fundamentally we're speaking about the same thing: grouping pages
> by their movability on a smaller scale. With a new zone we'll split
> pages into two parts with a fixed border, with new pageblock layer
> in 1GB blocks.

I also discussed moving the border on demand, which is way more tricky
and would definitely be stuff for the future.

There are some papers about similar fragmentation-avoidance techniques,
mostly in the context of energy efficiency IIRC. Especially:
- PALLOC: https://ieeexplore.ieee.org/document/6925999
- Adaptive-buddy:
https://ieeexplore.ieee.org/document/7397629?reload=true&arnumber=7397629

IIRC, the problem about such approaches is that they are quite invasive
and degrade some workloads due to overhead.

> 
> I think the agreement is that we need such functionality.

Yeah, on my long todo list. I'll be prototyping ZONE_RPEFER_MOVABLE
soon, to see how it looks/feels/performs.
Michal Hocko Oct. 6, 2020, 11:59 a.m. UTC | #23
On Mon 05-10-20 14:05:17, Zi Yan wrote:
> On 5 Oct 2020, at 13:39, David Hildenbrand wrote:
> 
> >>>> consideting that 2MB THP have turned out to be quite a pain but
> >>>> situation has settled over time. Maybe our current code base is prepared
> >>>> for that much better.
> >>
> >> I am planning to refactor my code further to reduce the amount of
> >> the added code, since PUD THP is very similar to PMD THP. One thing
> >> I want to achieve is to enable split_huge_page to split any order of
> >> pages to a group of any lower order of pages. A lot of code in this
> >> patchset is replicating the same behavior of PMD THP at PUD level.
> >> It might be possible to deduplicate most of the code.
> >>
> >>>>
> >>>> Exposing that interface to the userspace is a different story of course.
> >>>> I do agree that we likely do not want to be very explicit about that.
> >>>> E.g. an interface for address space defragmentation without any more
> >>>> specifics sounds like a useful feature to me. It will be up to the
> >>>> kernel to decide which huge pages to use.
> >>>
> >>> Yes, I think one important feature would be that we don't end up placing
> >>> a gigantic page where only a handful of pages are actually populated
> >>> without green light from the application - because that's what some user
> >>> space applications care about (not consuming more memory than intended.
> >>> IIUC, this is also what this patch set does). I'm fine with placing
> >>> gigantic pages if it really just "defragments" the address space layout,
> >>> without filling unpopulated holes.
> >>>
> >>> Then, this would be mostly invisible to user space, and we really
> >>> wouldn't have to care about any configuration.
> >>
> >>
> >> I agree that the interface should be as simple as no configuration to
> >> most users. But I also wonder why we have hugetlbfs to allow users to
> >> specify different kinds of page sizes, which seems against the discussion
> >> above. Are we assuming advanced users should always use hugetlbfs instead
> >> of THPs?
> >
> > Well, with hugetlbfs you get a real control over which pagesizes to use.
> > No mixture, guarantees.
> >
> > In some environments you might want to control which application gets
> > which pagesize. I know of database applications and hypervisors that
> > sometimes really want 2MB huge pages instead of 1GB huge pages. And
> > sometimes you really want/need 1GB huge pages (e.g., low-latency
> > applications, real-time KVM, ...).
> >
> > Simple example: KVM with postcopy live migration
> >
> > While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> > on demand (via userfaultdfd) is a painfully slow / impractical.
> 
> 
> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code
> itself.

Not really. Hugetlb is defined to provide a consistent and single page
size access to the memory. To the degree that you fail early if you
cannot guarantee that. This is not an implementation detail. This is the
semantic of the feature. Control goes along with the interface.

> If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

THPs are a completely different beast. They are aiming to be transparent
so that user doesn't really have to control them explicitly[1]. They should
be dynamically created and demoted as the system manages resources
behind users back. In short they optimize rather than guanratee. This is
also the reason why a complete control sounds quite alien to me. Say you
explicitly ask for THP_SIZEFOO but the kernel decides a completely
different size later on. What is the actual contract you as a user are
getting?

In an ideal world the kernel would pick up the best large page
automagically. I am a bit skeptical we will reach such an enlightment
soon (if ever) so a certain level of hinting is likely needed to prevent
2MB THP fiasco again [1]. But the control should correspond to the
functionality users are getting.

> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

In short it provides a guarantee.

Does the above clarifies it a bit?


[1] this is not entirely true though because there is a non-trivial
admin interface around THP. Mostly because they turned out to be too
transparent and many people do care about internal fragmentation,
allocation latency, locality (small page on a local node or a large on a
slightly further one?) or simply follow a cargo cult - just have a look
how many admin guides recommend disabling THPs. We got seriously burned
by 2MB THP because of the way how they were enforced on users.