mbox series

[v3,0/2] attempt to map anonymous pte-mapped THPs by pmds

Message ID cover.1702882426.git.xuyu@linux.alibaba.com (mailing list archive)
Headers show
Series attempt to map anonymous pte-mapped THPs by pmds | expand

Message

Xu Yu Dec. 18, 2023, 7:06 a.m. UTC
Result of tools/testing/selftests/mm/cow.c tests:
# [RUN] Basic COW after fork() when collapsing before fork()
ok 145 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
ok 146 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
ok 147 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
ok 148 No leak from parent into child

A long run (w/ CONFIG_DEBUG_VM enabled) shows no panic or memory leaks.

Changes since v2:
- Use folios in the new code, as suggested by David.
- Handle folio refcount and rmap properly, as suggested by David.
- minor modification includes 1) advance vma write lock, 2) remove
  redundant rollback logic, 3) clear old ptes in pgtable before deposit.

Changes since v1:
- Deal with PageAnonExclusive properly, as suggested by David.

Xu Yu (2):
  mm/khugepaged: map RO non-exclusive pte-mapped anon THPs by pmds
  mm/khugepaged: map exclusive anonymous pte-mapped THPs by pmds

 mm/khugepaged.c | 229 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)

Comments

Zach O'Keefe Dec. 21, 2023, 8:40 p.m. UTC | #1
Hey Xu,

Thanks for the patches.

As a precursor, can you help understand what the use case is for these
patches? In-place collapse of anon memory is something I've thought
about before, but the opportunity has never been especially clear.

In particular, your patches take an order-9 compound page, and just
try to see if we can update the mappings to it (like we do with
file/shmem). Functionally this seems fine, but the difference is that
with file/shmem, it's quite easy to have a pte-mapped-hugepage arise
naturally (the formation of the hugepage happening in the pagecache
being logically separate from the pmd-mapping of w/e task is mapping
it).\

For anonymous memory, the only time I can see us having a pte-mapped
hugepage (that isn't destined for splitting on deferred split list)
that we want to remap by a pmd is if we cause a VMA split + remerge by
mucking with VMA attributes.

In my mind, what I had been thinking of w.r.t in-place anon collapse
was for the case where we've split a THP with MADV_FREE/MADV_DONTNEED
(i.e. to subrelease pages back to kernel), but later want to reform
the THP. In particular, if, for example, we only subrelease O(10s) of
order-0 pages, it seems wasteful to have to reallocate a fresh
hugepage, then copy over O(100s) of pages, on collapse. If we were
able to attempt to first migrate-away any of those previously
subreleased pages (now possibly backing some other memory entirely),
it could save us from having to allocate a fresh order-9 page. Under
memory pressure / fragmentation, this could mean the difference
between success and failure.

Thanks for your help here,
Zach

On Sun, Dec 17, 2023 at 11:06 PM Xu Yu <xuyu@linux.alibaba.com> wrote:
>
> Result of tools/testing/selftests/mm/cow.c tests:
> # [RUN] Basic COW after fork() when collapsing before fork()
> ok 145 No leak from parent into child
> # [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
> ok 146 No leak from parent into child
> # [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
> ok 147 No leak from parent into child
> # [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
> ok 148 No leak from parent into child
>
> A long run (w/ CONFIG_DEBUG_VM enabled) shows no panic or memory leaks.
>
> Changes since v2:
> - Use folios in the new code, as suggested by David.
> - Handle folio refcount and rmap properly, as suggested by David.
> - minor modification includes 1) advance vma write lock, 2) remove
>   redundant rollback logic, 3) clear old ptes in pgtable before deposit.
>
> Changes since v1:
> - Deal with PageAnonExclusive properly, as suggested by David.
>
> Xu Yu (2):
>   mm/khugepaged: map RO non-exclusive pte-mapped anon THPs by pmds
>   mm/khugepaged: map exclusive anonymous pte-mapped THPs by pmds
>
>  mm/khugepaged.c | 229 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 229 insertions(+)
>
> --
> 2.37.1
>
>
David Hildenbrand Dec. 21, 2023, 8:54 p.m. UTC | #2
On 21.12.23 21:40, Zach O'Keefe wrote:
> Hey Xu,
> 
> Thanks for the patches.
> 
> As a precursor, can you help understand what the use case is for these
> patches? In-place collapse of anon memory is something I've thought
> about before, but the opportunity has never been especially clear.
> 
> In particular, your patches take an order-9 compound page, and just
> try to see if we can update the mappings to it (like we do with
> file/shmem). Functionally this seems fine, but the difference is that
> with file/shmem, it's quite easy to have a pte-mapped-hugepage arise
> naturally (the formation of the hugepage happening in the pagecache
> being logically separate from the pmd-mapping of w/e task is mapping
> it).\
> 
> For anonymous memory, the only time I can see us having a pte-mapped
> hugepage (that isn't destined for splitting on deferred split list)
> that we want to remap by a pmd is if we cause a VMA split + remerge by
> mucking with VMA attributes.

Yes, mostly because of madvise(), mprotect(), mremap(). But also, when 
putting a THP into the swap cache right now. When refaulting, you get a 
PTE-mapped THP.

There are some other odd cases, and there might be more in the future 
(below)

> 
> In my mind, what I had been thinking of w.r.t in-place anon collapse
> was for the case where we've split a THP with MADV_FREE/MADV_DONTNEED
> (i.e. to subrelease pages back to kernel), but later want to reform
> the THP. In particular, if, for example, we only subrelease O(10s) of

Right, and in-place collapse even works if the folio has been pinned, 
which is nice.

> order-0 pages, it seems wasteful to have to reallocate a fresh
> hugepage, then copy over O(100s) of pages, on collapse. If we were
> able to attempt to first migrate-away any of those previously
> subreleased pages (now possibly backing some other memory entirely),
> it could save us from having to allocate a fresh order-9 page. Under
> memory pressure / fragmentation, this could mean the difference
> between success and failure.
> 

One thing that popped up a couple of times already is that we might want 
to PTE-map a PMD-sized THP for a couple of reasons (IIRC, FreeBSD does 
some of that). For example:

* Lazily zero the pages of the folio on demand, keeping all non-zeroed
   parts protnone. At a certain time (e.g., all zeroed), simply remap
   using a PMD.
* Detecting sub-page access by temporarily mapping the THP using PTEs.

Maybe, also some uffd optimizations, whereby protnone parts are not 
faulted in yet.