mbox series

[v3,00/15] Free user PTE page table pages

Message ID 20211110105428.32458-1-zhengqi.arch@bytedance.com (mailing list archive)
Headers show
Series Free user PTE page table pages | expand

Message

Qi Zheng Nov. 10, 2021, 10:54 a.m. UTC
Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.6 MB
VmPTE                102640 kB                         248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads         before (pf/min)                     after (pf/min)
    1                32,085,255                         31,880,833 (-0.64%)
    8               101,674,967                        100,588,311 (-1.17%)
   16               113,207,000                        112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
  19.29%  [kernel]  [k] clear_page_rep
  16.12%  [kernel]  [k] do_user_addr_fault
   9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.16%  [kernel]  [k] get_page_from_freelist
   5.03%  [kernel]  [k] __handle_mm_fault
   3.53%  [kernel]  [k] __rcu_read_unlock
   3.45%  [kernel]  [k] handle_mm_fault
   3.38%  [kernel]  [k] down_read_trylock
   2.74%  [kernel]  [k] free_unref_page_list
   2.17%  [kernel]  [k] up_read
   1.93%  [kernel]  [k] charge_memcg
   1.73%  [kernel]  [k] try_charge_memcg
   1.71%  [kernel]  [k] __alloc_pages
   1.69%  [kernel]  [k] ___perf_sw_event
   1.44%  [kernel]  [k] get_mem_cgroup_from_mm

after:
  18.19%  [kernel]  [k] clear_page_rep
  16.28%  [kernel]  [k] do_user_addr_fault
   8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
   5.12%  [kernel]  [k] get_page_from_freelist
   4.81%  [kernel]  [k] __handle_mm_fault
   4.68%  [kernel]  [k] down_read_trylock
   3.80%  [kernel]  [k] handle_mm_fault
   3.59%  [kernel]  [k] get_mem_cgroup_from_mm
   2.49%  [kernel]  [k] free_unref_page_list
   2.41%  [kernel]  [k] up_read
   2.16%  [kernel]  [k] charge_memcg
   1.92%  [kernel]  [k] __rcu_read_unlock
   1.88%  [kernel]  [k] ___perf_sw_event
   1.70%  [kernel]  [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Changelog in v2 -> v3:
 - Refactored this patch series:
        - [PATCH v3 6/15]: Introduce the new dummy helpers first
        - [PATCH v3 7-12/15]: Convert each subsystem individually
        - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
   And thanks for the advice from David and Jason.
 - Add a document.

Changelog in v1 -> v2:
 - Change pte_install() to pmd_install().
 - Fix some typo and code style problems.
 - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
   and [PATCH v2 7/9].

Qi Zheng (15):
  mm: do code cleanups to filemap_map_pmd()
  mm: introduce is_huge_pmd() helper
  mm: move pte_offset_map_lock() to pgtable.h
  mm: rework the parameter of lock_page_or_retry()
  mm: add pmd_installed_type return for __pte_alloc() and other friends
  mm: introduce refcount for user PTE page table page
  mm/pte_ref: add support for user PTE page table page allocation
  mm/pte_ref: initialize the refcount of the withdrawn PTE page table
    page
  mm/pte_ref: add support for the map/unmap of user PTE page table page
  mm/pte_ref: add support for page fault path
  mm/pte_ref: take a refcount before accessing the PTE page table page
  mm/pte_ref: update the pmd entry in move_normal_pmd()
  mm/pte_ref: free user PTE page table pages
  Documentation: add document for pte_ref
  mm/pte_ref: use mmu_gather to free PTE page table pages

 Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
 arch/x86/Kconfig             |   2 +-
 fs/proc/task_mmu.c           |  24 +++-
 fs/userfaultfd.c             |   9 +-
 include/linux/huge_mm.h      |  10 +-
 include/linux/mm.h           | 170 ++++-------------------------
 include/linux/mm_types.h     |   6 +-
 include/linux/pagemap.h      |   8 +-
 include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
 include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
 include/linux/rmap.h         |   2 +
 kernel/events/uprobes.c      |   2 +
 mm/Kconfig                   |   4 +
 mm/Makefile                  |   4 +-
 mm/damon/vaddr.c             |  12 +-
 mm/debug_vm_pgtable.c        |   5 +-
 mm/filemap.c                 |  45 +++++---
 mm/gup.c                     |  25 ++++-
 mm/hmm.c                     |   5 +-
 mm/huge_memory.c             |   3 +-
 mm/internal.h                |   4 +-
 mm/khugepaged.c              |  21 +++-
 mm/ksm.c                     |   6 +-
 mm/madvise.c                 |  21 +++-
 mm/memcontrol.c              |  12 +-
 mm/memory-failure.c          |  11 +-
 mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c               |   6 +-
 mm/migrate.c                 |  54 ++++-----
 mm/mincore.c                 |   7 +-
 mm/mlock.c                   |   1 +
 mm/mmu_gather.c              |  40 +++----
 mm/mprotect.c                |  11 +-
 mm/mremap.c                  |  14 ++-
 mm/page_vma_mapped.c         |   4 +
 mm/pagewalk.c                |  15 ++-
 mm/pgtable-generic.c         |   1 +
 mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
 mm/rmap.c                    |  10 ++
 mm/swapfile.c                |   3 +
 mm/userfaultfd.c             |  40 +++++--
 41 files changed, 1186 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

Comments

Jason Gunthorpe Nov. 10, 2021, 12:56 p.m. UTC | #1
On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:

> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.

So, this approach basically adds two atomics on every PTE map

If I have it right the reason that zap cannot clean the PTEs today is
because zap cannot obtain the mmap lock due to a lock ordering issue
with the inode lock vs mmap lock.

If it could obtain the mmap lock then it could do the zap using the
write side as unmapping a vma does.

Rather than adding a new "lock" to ever PTE I wonder if it would be
more efficient to break up the mmap lock and introduce a specific
rwsem for the page table itself, in addition to the PTL. Currently the
mmap lock is protecting both the vma list and the page table.

I think that would allow the lock ordering issue to be resolved and
zap could obtain a page table rwsem.

Compared to two atomics per PTE this would just be two atomic per
page table walk operation, it is conceptually a lot simpler, and would
allow freeing all the page table levels, not just PTEs.

?

Jason
David Hildenbrand Nov. 10, 2021, 1:25 p.m. UTC | #2
On 10.11.21 13:56, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

There are different ways to zap: madvise(DONTNEED) vs
fallocate(PUNCH_HOLE). It depends on "from where" we're actually
comming: a process page table walker or the rmap.

The way locking currently works doesn't allow to remove a page table
just by holding the mmap lock, not even in write mode. You'll also need
to hold the respective rmap locks -- which implies that reclaiming apge
tables crossing VMAs is "problematic". Take a look at khugepaged which
has to play quite some tricks to remove a page table.

And there are other ways we can create empty page tables via the rmap,
like reclaim/writeback, although they are rather a secondary concern mostly.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.
> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

There is the rmap side of things as well. At least the rmap won't
reclaim alloc/free page tables, but it will walk page tables while
holding the respective rmap lock.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

Another alternative is to not do it in the kernel automatically, but
instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
called by user space explicitly once it's reasonable. While this will
work for the obvious madvise(DONTNEED) users -- like memory allocators
-- that zap memory, it's a bit more complicated once shared memory is
involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
work for many use cases that want to optimize memory consumption for
sparse memory mappings.

Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
are most certainly the most important part piece.
Qi Zheng Nov. 10, 2021, 1:54 p.m. UTC | #3
On 11/10/21 8:56 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of
mmap_lock instead of write side, which is the reason that 
jemalloc/tcmalloc prefer to use madvise() to release physical memory.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.

Even if it obtains the write side of mmap_lock, how to make sure that
all the page table entries are empty? Traverse 512 entries every time?

> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

Now each level of page table has its own spin lock. Can you explain the
working mechanism of this special rwsem more clearly?

If we can reduce the protection range of mmap_lock, it is indeed a great
thing, but I think it is very difficult, and it will not solve the
problem of how to check that all entries in the page table page are
empty.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

The reason why only the PTE page is released now is that it is the
largest. This reference count can actually be used for other levels of
page tables.

> 
> ?
> 
> Jason
>
Qi Zheng Nov. 10, 2021, 1:59 p.m. UTC | #4
On 11/10/21 9:25 PM, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>
>>> In this patch series, we add a pte_refcount field to the struct page of page
>>> table to track how many users of PTE page table. Similar to the mechanism of
>>> page refcount, the user of PTE page table should hold a refcount to it before
>>> accessing. The PTE page table page will be freed when the last refcount is
>>> dropped.
>>
>> So, this approach basically adds two atomics on every PTE map
>>
>> If I have it right the reason that zap cannot clean the PTEs today is
>> because zap cannot obtain the mmap lock due to a lock ordering issue
>> with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.
> 
> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. You'll also need
> to hold the respective rmap locks -- which implies that reclaiming apge
> tables crossing VMAs is "problematic". Take a look at khugepaged which
> has to play quite some tricks to remove a page table.
> 
> And there are other ways we can create empty page tables via the rmap,
> like reclaim/writeback, although they are rather a secondary concern mostly.
> 
>>
>> If it could obtain the mmap lock then it could do the zap using the
>> write side as unmapping a vma does.
>>
>> Rather than adding a new "lock" to ever PTE I wonder if it would be
>> more efficient to break up the mmap lock and introduce a specific
>> rwsem for the page table itself, in addition to the PTL. Currently the
>> mmap lock is protecting both the vma list and the page table.
> 
> There is the rmap side of things as well. At least the rmap won't
> reclaim alloc/free page tables, but it will walk page tables while
> holding the respective rmap lock.
> 
>>
>> I think that would allow the lock ordering issue to be resolved and
>> zap could obtain a page table rwsem.
>>
>> Compared to two atomics per PTE this would just be two atomic per
>> page table walk operation, it is conceptually a lot simpler, and would
>> allow freeing all the page table levels, not just PTEs.
> 
> Another alternative is to not do it in the kernel automatically, but
> instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
> called by user space explicitly once it's reasonable. While this will
> work for the obvious madvise(DONTNEED) users -- like memory allocators
> -- that zap memory, it's a bit more complicated once shared memory is
> involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
> work for many use cases that want to optimize memory consumption for
> sparse memory mappings.
> 
> Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
> will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
> are most certainly the most important part piece.
> 

total agree!

Thanks,
Qi
Jason Gunthorpe Nov. 10, 2021, 2:38 p.m. UTC | #5
On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> > 
> >> In this patch series, we add a pte_refcount field to the struct page of page
> >> table to track how many users of PTE page table. Similar to the mechanism of
> >> page refcount, the user of PTE page table should hold a refcount to it before
> >> accessing. The PTE page table page will be freed when the last refcount is
> >> dropped.
> > 
> > So, this approach basically adds two atomics on every PTE map
> > 
> > If I have it right the reason that zap cannot clean the PTEs today is
> > because zap cannot obtain the mmap lock due to a lock ordering issue
> > with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.

AFAIK rmap is the same issue, it can't lock the mmap_sem

> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. 

I'm not sure I understand this. If the goal is to free the PTE tables
then the main concern is use-after free on page table walkers (which
this series is addressing). Ignoring bugs, we have only three ways to
read the page table:

 - Fully locked. Under the PTLs (gup slow is an example)
 - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
 - hw-locked. Barriered with TLB flush (gup fast is an example)

#1 should be completely safe as the PTLs will protect eveything
#2 is safe so long as the write side is held during any layout changes
#3 interacts with the TLB flush, and is also safe with zap

rmap itself is a #1 page table walker, ie it gets the PTLs under
page_vma_mapped_walk().

The sin we have comitted here is that both the mmap lock and the PTLs
are being used to protect the page table itself with a very
complicated dual semantic.

Splitting the sleeping mmap lock into 'covers vma' and 'covers page
tables' lets us solve the lock ordering and semi-locked can become
more fully locked by the new lock, instead of by abusing mmap sem.

I'd suggest to make this new lock a special rwsem which allows either
concurrent read access OR concurrent PTL access, but not both. This
way we don't degrade performance of the split PTLs, *and* when
something needs to change the page table structure it has a way to
properly exclude all the #2 lockless readers.

So evey touch to the page table starts by obtaining this new lock,
depending on the access mode to be used. (PTL vs lockless read)

We can keep the existing THP logic where a leaf PMD can be transformed
to a non-leaf PMD in the semi-locked case, but the case where a
non-leaf PMD is transformed to a leaf PMD has to take the lock.

Jason
David Hildenbrand Nov. 10, 2021, 3:37 p.m. UTC | #6
On 10.11.21 15:38, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
>> On 10.11.21 13:56, Jason Gunthorpe wrote:
>>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>>
>>>> In this patch series, we add a pte_refcount field to the struct page of page
>>>> table to track how many users of PTE page table. Similar to the mechanism of
>>>> page refcount, the user of PTE page table should hold a refcount to it before
>>>> accessing. The PTE page table page will be freed when the last refcount is
>>>> dropped.
>>>
>>> So, this approach basically adds two atomics on every PTE map
>>>
>>> If I have it right the reason that zap cannot clean the PTEs today is
>>> because zap cannot obtain the mmap lock due to a lock ordering issue
>>> with the inode lock vs mmap lock.
>>
>> There are different ways to zap: madvise(DONTNEED) vs
>> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
>> comming: a process page table walker or the rmap.
> 
> AFAIK rmap is the same issue, it can't lock the mmap_sem
> 
>> The way locking currently works doesn't allow to remove a page table
>> just by holding the mmap lock, not even in write mode. 
> 
> I'm not sure I understand this. If the goal is to free the PTE tables
> then the main concern is use-after free on page table walkers (which
> this series is addressing). Ignoring bugs, we have only three ways to
> read the page table:

Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs.

> 
>  - Fully locked. Under the PTLs (gup slow is an example)
>  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
>  - hw-locked. Barriered with TLB flush (gup fast is an example)

Three additions as far as I can tell:

1. Fully locked currently needs the read mmap lock OR the rmap lock in
   read. PTLs on their own are not sufficient AFAIKT.
2. #1 and #2 can currently only walk within VMA ranges.
3. We can theoretically walk page tables outside VMA ranges with the
write mmap lock, because page tables get removed with the mmap lock in
read mode and heavy-weight operations (VMA layout, khugepaged) are
performed under the write mmap lock.

The rmap locks protect from modifications where we want to exclude rmap
walkers similarly to how we grab the mmap lock in write, where the PTLs
are not sufficient.

See mm/mremap.c:move_ptes() as an example which performs VMA layout +
page table modifications. See khugepagd which doesn't perform VMA layout
modifications but page table modifications.

> 
> #1 should be completely safe as the PTLs will protect eveything
> #2 is safe so long as the write side is held during any layout changes
> #3 interacts with the TLB flush, and is also safe with zap
> 
> rmap itself is a #1 page table walker, ie it gets the PTLs under
> page_vma_mapped_walk().

When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Because the PMD-PTLs re usually not taken in case we know there is a
page table (nothing would currently change it without heavy locking).
And if they are taken, they are only held while allocating/checking a
PMDE, not while actually *using* the page table that's mapped in that entry.

For example, walk_page_range() requires the mmap lock in read and grabs
the PTE-PTLs.

> 
> The sin we have comitted here is that both the mmap lock and the PTLs
> are being used to protect the page table itself with a very
> complicated dual semantic.
> 
> Splitting the sleeping mmap lock into 'covers vma' and 'covers page
> tables' lets us solve the lock ordering and semi-locked can become
> more fully locked by the new lock, instead of by abusing mmap sem.

It would still be a fairly coarse-grained locking, I am not sure if that
is a step into the right direction. If you want to modify *some* page
table in your process you have exclude each and every page table walker.
Or did I mis-interpret what you were saying?

> 
> I'd suggest to make this new lock a special rwsem which allows either
> concurrent read access OR concurrent PTL access, but not both. This

I looked into such a lock recently in similar context and something like
that does not exist yet (and fairness will be challenging). You either
have a single reader or multiple writer. I'd be interested if someone
knows of something like that.
Jason Gunthorpe Nov. 10, 2021, 4:39 p.m. UTC | #7
On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:

> >  - Fully locked. Under the PTLs (gup slow is an example)
> >  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> >  - hw-locked. Barriered with TLB flush (gup fast is an example)
> 
> Three additions as far as I can tell:
> 
> 1. Fully locked currently needs the read mmap lock OR the rmap lock in
>    read. PTLs on their own are not sufficient AFAIKT.

I think the reality is we don't have any fully locked walkers.. Even
gup slow is semi-lockless until it reaches lower levels, then it takes
the PTLs. (I forgot about that detail!)

The mmap lock is being used to protect the higher levels of the page
table. It is part of its complicated dual purpose.

> 2. #1 and #2 can currently only walk within VMA ranges.

AFICT this is an artifact of re-using the mmap lock to protect the
page table and not being able to obtain the mmap lock in all the
places the page table structure is manipulated.

> 3. We can theoretically walk page tables outside VMA ranges with the
> write mmap lock, because page tables get removed with the mmap lock in
> read mode and heavy-weight operations (VMA layout, khugepaged) are
> performed under the write mmap lock.

Yes, again, an artifact of the current locking.
 
> The rmap locks protect from modifications where we want to exclude rmap
> walkers similarly to how we grab the mmap lock in write, where the PTLs
> are not sufficient.

It is a similar dual purpose locking as the mmap sem :(

> > #1 should be completely safe as the PTLs will protect eveything
> > #2 is safe so long as the write side is held during any layout changes
> > #3 interacts with the TLB flush, and is also safe with zap
> > 
> > rmap itself is a #1 page table walker, ie it gets the PTLs under
> > page_vma_mapped_walk().
> 
> When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Ah, here I was thinking about a lock that can protect all the
levels. Today we are abusing the mmap lock to act as the pud_lock, for
instance.

> Because the PMD-PTLs re usually not taken in case we know there is a
> page table (nothing would currently change it without heavy
> locking).

This only works with the lockless walkers, and relies on the read mmap
sem/etc to also mean 'a PTE table cannot become a leaf PMD'

> For example, walk_page_range() requires the mmap lock in read and grabs
> the PTE-PTLs.

Yes, that is a semi-locked reader.

> It would still be a fairly coarse-grained locking, I am not sure if that
> is a step into the right direction. If you want to modify *some* page
> table in your process you have exclude each and every page table walker.
> Or did I mis-interpret what you were saying?

That is one possible design, it favours fast walking and penalizes
mutation. We could also stick a lock in the PMD (instead of a
refcount) and still logically be using a lock instead of a refcount
scheme. Remember modify here is "want to change a table pointer into a
leaf pointer" so it isn't an every day activity..

There is some advantage with this thinking because it harmonizes well
with the other stuff that wants to convert tables into leafs, but has
to deal with complicated locking.

On the other hand, refcounts are a degenerate kind of rwsem and only
help with freeing pages. It also puts more atomics in normal fast
paths since we are refcounting each PTE, not read locking the PMD.

Perhaps the ideal thing would be to stick a rwsem in the PMD. read
means a table cannot be come a leaf. I don't know if there is space
for another atomic in the PMD level, and we'd have to use a hitching
post/hashed waitq scheme too since there surely isn't room for a waitq
too..

I wouldn't be so quick to say one is better than the other, but at
least let's have thought about a locking solution before merging
refcounts :)

Jason
Matthew Wilcox (Oracle) Nov. 10, 2021, 4:49 p.m. UTC | #8
On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> > I'd suggest to make this new lock a special rwsem which allows either
> > concurrent read access OR concurrent PTL access, but not both. This
> 
> I looked into such a lock recently in similar context and something like
> that does not exist yet (and fairness will be challenging). You either
> have a single reader or multiple writer. I'd be interested if someone
> knows of something like that.

We've talked about having such a lock before for filesystems which want
to permit either many direct-IO accesses or many buffered-IO accesses, but
want to exclude a mixture of direct-IO and buffered-IO.  The name we came
up with for such a lock was the red-blue lock.  Either Team Red has the
lock, or Team Blue has the lock (or it's free).  Indicate free with velue
zero, Team Red with positive numbers and Team Blue with negative numbers.
If we need to indicate an opposing team is waiting for the semaphore,
we can use a high bit (1 << 30) to indicate no new team members can
acquire the lock.  Not sure whether anybody ever coded it up.
David Hildenbrand Nov. 10, 2021, 4:53 p.m. UTC | #9
On 10.11.21 17:49, Matthew Wilcox wrote:
> On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
>>> I'd suggest to make this new lock a special rwsem which allows either
>>> concurrent read access OR concurrent PTL access, but not both. This
>>
>> I looked into such a lock recently in similar context and something like
>> that does not exist yet (and fairness will be challenging). You either
>> have a single reader or multiple writer. I'd be interested if someone
>> knows of something like that.
> 
> We've talked about having such a lock before for filesystems which want
> to permit either many direct-IO accesses or many buffered-IO accesses, but
> want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> up with for such a lock was the red-blue lock.  Either Team Red has the
> lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> zero, Team Red with positive numbers and Team Blue with negative numbers.
> If we need to indicate an opposing team is waiting for the semaphore,
> we can use a high bit (1 << 30) to indicate no new team members can
> acquire the lock.  Not sure whether anybody ever coded it up.

Interesting, thanks for sharing!

My excessive google search didn't reveal anything back then (~3 months
ago) -- only questions on popular coding websites asking essentially for
the same thing without any helpful replies. But maybe I used the wrong
keywords (e.g., "multiple reader, multiple writer", I certainly didn't
search for "red-blue lock", but I do like the name :) ).

Fairness might still be the biggest issue, but I am certainly no locking
expert.
Jason Gunthorpe Nov. 10, 2021, 4:56 p.m. UTC | #10
On Wed, Nov 10, 2021 at 05:53:57PM +0100, David Hildenbrand wrote:
> On 10.11.21 17:49, Matthew Wilcox wrote:
> > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> >>> I'd suggest to make this new lock a special rwsem which allows either
> >>> concurrent read access OR concurrent PTL access, but not both. This
> >>
> >> I looked into such a lock recently in similar context and something like
> >> that does not exist yet (and fairness will be challenging). You either
> >> have a single reader or multiple writer. I'd be interested if someone
> >> knows of something like that.
> > 
> > We've talked about having such a lock before for filesystems which want
> > to permit either many direct-IO accesses or many buffered-IO accesses, but
> > want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> > up with for such a lock was the red-blue lock.  Either Team Red has the
> > lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> > zero, Team Red with positive numbers and Team Blue with negative numbers.
> > If we need to indicate an opposing team is waiting for the semaphore,
> > we can use a high bit (1 << 30) to indicate no new team members can
> > acquire the lock.  Not sure whether anybody ever coded it up.
> 
> Interesting, thanks for sharing!
> 
> My excessive google search didn't reveal anything back then (~3 months
> ago) -- only questions on popular coding websites asking essentially for
> the same thing without any helpful replies. But maybe I used the wrong
> keywords (e.g., "multiple reader, multiple writer", I certainly didn't
> search for "red-blue lock", but I do like the name :) ).
> 
> Fairness might still be the biggest issue, but I am certainly no locking
> expert.

Fairness could use the same basic logic as the write prefered to read
in the rwsem.

The atomic implementation would be with atomic_dec_unless_positive()
and atomic_inc_unless_negative(), without fairness it looks
straightfoward.

Jason
David Hildenbrand Nov. 10, 2021, 5:37 p.m. UTC | #11
>> It would still be a fairly coarse-grained locking, I am not sure if that
>> is a step into the right direction. If you want to modify *some* page
>> table in your process you have exclude each and every page table walker.
>> Or did I mis-interpret what you were saying?
> 
> That is one possible design, it favours fast walking and penalizes
> mutation. We could also stick a lock in the PMD (instead of a
> refcount) and still logically be using a lock instead of a refcount
> scheme. Remember modify here is "want to change a table pointer into a
> leaf pointer" so it isn't an every day activity..

It will be if we somewhat frequent when reclaim an empty PTE page table
as soon as it turns empty. This not only happens when zapping, but also
during writeback/swapping. So while writing back / swapping you might be
left with empty page tables to reclaim.

Of course, this is the current approach. Another approach that doesn't
require additional refcounts is scanning page tables for empty ones and
reclaiming them. This scanning can either be triggered manually from
user space or automatically from the kernel.

> 
> There is some advantage with this thinking because it harmonizes well
> with the other stuff that wants to convert tables into leafs, but has
> to deal with complicated locking.
> 
> On the other hand, refcounts are a degenerate kind of rwsem and only
> help with freeing pages. It also puts more atomics in normal fast
> paths since we are refcounting each PTE, not read locking the PMD.
> 
> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
> means a table cannot be come a leaf. I don't know if there is space
> for another atomic in the PMD level, and we'd have to use a hitching
> post/hashed waitq scheme too since there surely isn't room for a waitq
> too..
> 
> I wouldn't be so quick to say one is better than the other, but at
> least let's have thought about a locking solution before merging
> refcounts :)

Yes, absolutely. I can see the beauty in the current approach, because
it just reclaims "automatically" once possible -- page table empty and
nobody is walking it. The downside is that it doesn't always make sense
to reclaim an empty page table immediately once it turns empty.

Also, it adds complexity for something that is only a problem in some
corner cases -- sparse memory mappings, especially relevant for some
memory allocators after freeing a lot of memory or running VMs with
memory ballooning after inflating the balloon. Some of these use cases
might be good with just triggering page table reclaim manually from user
space.
Jason Gunthorpe Nov. 10, 2021, 5:49 p.m. UTC | #12
On Wed, Nov 10, 2021 at 06:37:46PM +0100, David Hildenbrand wrote:

> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty.

What do you think is the frequency of unlocked page table walks that
this would have to block on?

> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.

Right, this is why it would be nice if the complexity could address
more than one problem, like the existing complex locking around the
thp stuff..

Jason
Qi Zheng Nov. 11, 2021, 3:58 a.m. UTC | #13
On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>> is a step into the right direction. If you want to modify *some* page
>>> table in your process you have exclude each and every page table walker.
>>> Or did I mis-interpret what you were saying?
>>
>> That is one possible design, it favours fast walking and penalizes
>> mutation. We could also stick a lock in the PMD (instead of a
>> refcount) and still logically be using a lock instead of a refcount
>> scheme. Remember modify here is "want to change a table pointer into a
>> leaf pointer" so it isn't an every day activity..
> 
> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty. This not only happens when zapping, but also
> during writeback/swapping. So while writing back / swapping you might be
> left with empty page tables to reclaim.
> 
> Of course, this is the current approach. Another approach that doesn't
> require additional refcounts is scanning page tables for empty ones and
> reclaiming them. This scanning can either be triggered manually from
> user space or automatically from the kernel.

Whether it is introducing a special rwsem or scanning an empty page
table, there are two problems as follows:

	#1. When to trigger the scanning or releasing?
	#2. Every time to release a 4K page table page, 512 page table
	    entries need to be scanned.

For #1, if the scanning is triggered manually from user space, the
kernel is relatively passive, and the user does not fully know the best
timing to scan. If the scanning is triggered automatically from the
kernel, that is great. But the timing is not easy to confirm, is it
scanned and reclaimed every time zap or try_to_unmap?

For #2, refcount has advantages.

> 
>>
>> There is some advantage with this thinking because it harmonizes well
>> with the other stuff that wants to convert tables into leafs, but has
>> to deal with complicated locking.
>>
>> On the other hand, refcounts are a degenerate kind of rwsem and only
>> help with freeing pages. It also puts more atomics in normal fast
>> paths since we are refcounting each PTE, not read locking the PMD.
>>
>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>> means a table cannot be come a leaf. I don't know if there is space
>> for another atomic in the PMD level, and we'd have to use a hitching
>> post/hashed waitq scheme too since there surely isn't room for a waitq
>> too..
>>
>> I wouldn't be so quick to say one is better than the other, but at
>> least let's have thought about a locking solution before merging
>> refcounts :)
> 
> Yes, absolutely. I can see the beauty in the current approach, because
> it just reclaims "automatically" once possible -- page table empty and
> nobody is walking it. The downside is that it doesn't always make sense
> to reclaim an empty page table immediately once it turns empty.
> 
> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.
> 

Yes, this is indeed a problem. Perhaps some flags can be introduced so
that the release of page table pages can be delayed in some cases.
Similar to the lazyfree mechanism in MADV_FREE?

Thanks,
Qi
David Hildenbrand Nov. 11, 2021, 9:22 a.m. UTC | #14
On 11.11.21 04:58, Qi Zheng wrote:
> 
> 
> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>> is a step into the right direction. If you want to modify *some* page
>>>> table in your process you have exclude each and every page table walker.
>>>> Or did I mis-interpret what you were saying?
>>>
>>> That is one possible design, it favours fast walking and penalizes
>>> mutation. We could also stick a lock in the PMD (instead of a
>>> refcount) and still logically be using a lock instead of a refcount
>>> scheme. Remember modify here is "want to change a table pointer into a
>>> leaf pointer" so it isn't an every day activity..
>>
>> It will be if we somewhat frequent when reclaim an empty PTE page table
>> as soon as it turns empty. This not only happens when zapping, but also
>> during writeback/swapping. So while writing back / swapping you might be
>> left with empty page tables to reclaim.
>>
>> Of course, this is the current approach. Another approach that doesn't
>> require additional refcounts is scanning page tables for empty ones and
>> reclaiming them. This scanning can either be triggered manually from
>> user space or automatically from the kernel.
> 
> Whether it is introducing a special rwsem or scanning an empty page
> table, there are two problems as follows:
> 
> 	#1. When to trigger the scanning or releasing?

For example when reclaiming memory, when scanning page tables in
khugepaged, or triggered by user space (note that this is the approach I
originally looked into). But it certainly requires more locking thought
to avoid stopping essentially any page table walker.

> 	#2. Every time to release a 4K page table page, 512 page table
> 	    entries need to be scanned.

It would happen only when actually trigger reclaim of page tables
(again, someone has to trigger it), so it's barely an issue.

For example, khugepaged already scans the page tables either way.

> 
> For #1, if the scanning is triggered manually from user space, the
> kernel is relatively passive, and the user does not fully know the best
> timing to scan. If the scanning is triggered automatically from the
> kernel, that is great. But the timing is not easy to confirm, is it
> scanned and reclaimed every time zap or try_to_unmap?
> 
> For #2, refcount has advantages.
> 
>>
>>>
>>> There is some advantage with this thinking because it harmonizes well
>>> with the other stuff that wants to convert tables into leafs, but has
>>> to deal with complicated locking.
>>>
>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>> help with freeing pages. It also puts more atomics in normal fast
>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>
>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>> means a table cannot be come a leaf. I don't know if there is space
>>> for another atomic in the PMD level, and we'd have to use a hitching
>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>> too..
>>>
>>> I wouldn't be so quick to say one is better than the other, but at
>>> least let's have thought about a locking solution before merging
>>> refcounts :)
>>
>> Yes, absolutely. I can see the beauty in the current approach, because
>> it just reclaims "automatically" once possible -- page table empty and
>> nobody is walking it. The downside is that it doesn't always make sense
>> to reclaim an empty page table immediately once it turns empty.
>>
>> Also, it adds complexity for something that is only a problem in some
>> corner cases -- sparse memory mappings, especially relevant for some
>> memory allocators after freeing a lot of memory or running VMs with
>> memory ballooning after inflating the balloon. Some of these use cases
>> might be good with just triggering page table reclaim manually from user
>> space.
>>
> 
> Yes, this is indeed a problem. Perhaps some flags can be introduced so
> that the release of page table pages can be delayed in some cases.
> Similar to the lazyfree mechanism in MADV_FREE?

The issue AFAIU is that once your refcount hits 0 (no more references,
no more entries), the longer you wait with reclaim, the longer others
have to wait for populating a fresh page table because the "page table
to be reclaimed" is still stuck around. You'd have to keep the refcount
increased for a while, and only drop it after a while. But when? And
how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
Qi Zheng Nov. 11, 2021, 11:08 a.m. UTC | #15
On 11/11/21 5:22 PM, David Hildenbrand wrote:
> On 11.11.21 04:58, Qi Zheng wrote:
>>
>>
>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>> is a step into the right direction. If you want to modify *some* page
>>>>> table in your process you have exclude each and every page table walker.
>>>>> Or did I mis-interpret what you were saying?
>>>>
>>>> That is one possible design, it favours fast walking and penalizes
>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>> refcount) and still logically be using a lock instead of a refcount
>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>> leaf pointer" so it isn't an every day activity..
>>>
>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>> as soon as it turns empty. This not only happens when zapping, but also
>>> during writeback/swapping. So while writing back / swapping you might be
>>> left with empty page tables to reclaim.
>>>
>>> Of course, this is the current approach. Another approach that doesn't
>>> require additional refcounts is scanning page tables for empty ones and
>>> reclaiming them. This scanning can either be triggered manually from
>>> user space or automatically from the kernel.
>>
>> Whether it is introducing a special rwsem or scanning an empty page
>> table, there are two problems as follows:
>>
>> 	#1. When to trigger the scanning or releasing?
> 
> For example when reclaiming memory, when scanning page tables in
> khugepaged, or triggered by user space (note that this is the approach I
> originally looked into). But it certainly requires more locking thought
> to avoid stopping essentially any page table walker.
> 
>> 	#2. Every time to release a 4K page table page, 512 page table
>> 	    entries need to be scanned.
> 
> It would happen only when actually trigger reclaim of page tables
> (again, someone has to trigger it), so it's barely an issue.
> 
> For example, khugepaged already scans the page tables either way.
> 
>>
>> For #1, if the scanning is triggered manually from user space, the
>> kernel is relatively passive, and the user does not fully know the best
>> timing to scan. If the scanning is triggered automatically from the
>> kernel, that is great. But the timing is not easy to confirm, is it
>> scanned and reclaimed every time zap or try_to_unmap?
>>
>> For #2, refcount has advantages.
>>
>>>
>>>>
>>>> There is some advantage with this thinking because it harmonizes well
>>>> with the other stuff that wants to convert tables into leafs, but has
>>>> to deal with complicated locking.
>>>>
>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>> help with freeing pages. It also puts more atomics in normal fast
>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>
>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>> means a table cannot be come a leaf. I don't know if there is space
>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>> too..
>>>>
>>>> I wouldn't be so quick to say one is better than the other, but at
>>>> least let's have thought about a locking solution before merging
>>>> refcounts :)
>>>
>>> Yes, absolutely. I can see the beauty in the current approach, because
>>> it just reclaims "automatically" once possible -- page table empty and
>>> nobody is walking it. The downside is that it doesn't always make sense
>>> to reclaim an empty page table immediately once it turns empty.
>>>
>>> Also, it adds complexity for something that is only a problem in some
>>> corner cases -- sparse memory mappings, especially relevant for some
>>> memory allocators after freeing a lot of memory or running VMs with
>>> memory ballooning after inflating the balloon. Some of these use cases
>>> might be good with just triggering page table reclaim manually from user
>>> space.
>>>
>>
>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>> that the release of page table pages can be delayed in some cases.
>> Similar to the lazyfree mechanism in MADV_FREE?
> 
> The issue AFAIU is that once your refcount hits 0 (no more references,
> no more entries), the longer you wait with reclaim, the longer others
> have to wait for populating a fresh page table because the "page table
> to be reclaimed" is still stuck around. You'd have to keep the refcount
> increased for a while, and only drop it after a while. But when? And
> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
> 

For running VMs with memory ballooning after inflating the balloon, is
this a hot behavior? Even if it is, it is already facing the release and
reallocation of physical pages. The overhead after introducing
pte_refcount is that we need to release and re-allocate page table page.
But 2MB physical pages only corresponds to 4KiB of PTE page table page.
So maybe the overhead is not big.

In fact, the performance test shown on the cover letter is this case:

	test program: 
https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Thanks,
Qi

>
David Hildenbrand Nov. 11, 2021, 11:19 a.m. UTC | #16
On 11.11.21 12:08, Qi Zheng wrote:
> 
> 
> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>> On 11.11.21 04:58, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>> table in your process you have exclude each and every page table walker.
>>>>>> Or did I mis-interpret what you were saying?
>>>>>
>>>>> That is one possible design, it favours fast walking and penalizes
>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>> leaf pointer" so it isn't an every day activity..
>>>>
>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>> during writeback/swapping. So while writing back / swapping you might be
>>>> left with empty page tables to reclaim.
>>>>
>>>> Of course, this is the current approach. Another approach that doesn't
>>>> require additional refcounts is scanning page tables for empty ones and
>>>> reclaiming them. This scanning can either be triggered manually from
>>>> user space or automatically from the kernel.
>>>
>>> Whether it is introducing a special rwsem or scanning an empty page
>>> table, there are two problems as follows:
>>>
>>> 	#1. When to trigger the scanning or releasing?
>>
>> For example when reclaiming memory, when scanning page tables in
>> khugepaged, or triggered by user space (note that this is the approach I
>> originally looked into). But it certainly requires more locking thought
>> to avoid stopping essentially any page table walker.
>>
>>> 	#2. Every time to release a 4K page table page, 512 page table
>>> 	    entries need to be scanned.
>>
>> It would happen only when actually trigger reclaim of page tables
>> (again, someone has to trigger it), so it's barely an issue.
>>
>> For example, khugepaged already scans the page tables either way.
>>
>>>
>>> For #1, if the scanning is triggered manually from user space, the
>>> kernel is relatively passive, and the user does not fully know the best
>>> timing to scan. If the scanning is triggered automatically from the
>>> kernel, that is great. But the timing is not easy to confirm, is it
>>> scanned and reclaimed every time zap or try_to_unmap?
>>>
>>> For #2, refcount has advantages.
>>>
>>>>
>>>>>
>>>>> There is some advantage with this thinking because it harmonizes well
>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>> to deal with complicated locking.
>>>>>
>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>
>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>> too..
>>>>>
>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>> least let's have thought about a locking solution before merging
>>>>> refcounts :)
>>>>
>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>> it just reclaims "automatically" once possible -- page table empty and
>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>> to reclaim an empty page table immediately once it turns empty.
>>>>
>>>> Also, it adds complexity for something that is only a problem in some
>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>> memory allocators after freeing a lot of memory or running VMs with
>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>> might be good with just triggering page table reclaim manually from user
>>>> space.
>>>>
>>>
>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>> that the release of page table pages can be delayed in some cases.
>>> Similar to the lazyfree mechanism in MADV_FREE?
>>
>> The issue AFAIU is that once your refcount hits 0 (no more references,
>> no more entries), the longer you wait with reclaim, the longer others
>> have to wait for populating a fresh page table because the "page table
>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>> increased for a while, and only drop it after a while. But when? And
>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>
> 
> For running VMs with memory ballooning after inflating the balloon, is
> this a hot behavior? Even if it is, it is already facing the release and
> reallocation of physical pages. The overhead after introducing
> pte_refcount is that we need to release and re-allocate page table page.
> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
> So maybe the overhead is not big.

The cases that come to my mind are

a) Swapping on shared memory with concurrent access
b) Reclaim on file-backed memory with concurrent access
c) Free page reporting as implemented by virtio-balloon

In all of these cases, you can have someone immediately re-access the
page table and re-populate it.

For something mostly static (balloon inflation, memory allocator), it's
not that big of a deal I guess.
Qi Zheng Nov. 11, 2021, noon UTC | #17
On 11/11/21 7:19 PM, David Hildenbrand wrote:
> On 11.11.21 12:08, Qi Zheng wrote:
>>
>>
>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>
>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>
>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>> left with empty page tables to reclaim.
>>>>>
>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>> user space or automatically from the kernel.
>>>>
>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>> table, there are two problems as follows:
>>>>
>>>> 	#1. When to trigger the scanning or releasing?
>>>
>>> For example when reclaiming memory, when scanning page tables in
>>> khugepaged, or triggered by user space (note that this is the approach I
>>> originally looked into). But it certainly requires more locking thought
>>> to avoid stopping essentially any page table walker.
>>>
>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>> 	    entries need to be scanned.
>>>
>>> It would happen only when actually trigger reclaim of page tables
>>> (again, someone has to trigger it), so it's barely an issue.
>>>
>>> For example, khugepaged already scans the page tables either way.
>>>
>>>>
>>>> For #1, if the scanning is triggered manually from user space, the
>>>> kernel is relatively passive, and the user does not fully know the best
>>>> timing to scan. If the scanning is triggered automatically from the
>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>
>>>> For #2, refcount has advantages.
>>>>
>>>>>
>>>>>>
>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>> to deal with complicated locking.
>>>>>>
>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>
>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>> too..
>>>>>>
>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>> least let's have thought about a locking solution before merging
>>>>>> refcounts :)
>>>>>
>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>
>>>>> Also, it adds complexity for something that is only a problem in some
>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>> might be good with just triggering page table reclaim manually from user
>>>>> space.
>>>>>
>>>>
>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>> that the release of page table pages can be delayed in some cases.
>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>
>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>> no more entries), the longer you wait with reclaim, the longer others
>>> have to wait for populating a fresh page table because the "page table
>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>> increased for a while, and only drop it after a while. But when? And
>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>
>>
>> For running VMs with memory ballooning after inflating the balloon, is
>> this a hot behavior? Even if it is, it is already facing the release and
>> reallocation of physical pages. The overhead after introducing
>> pte_refcount is that we need to release and re-allocate page table page.
>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>> So maybe the overhead is not big.
> 
> The cases that come to my mind are
> 
> a) Swapping on shared memory with concurrent access
> b) Reclaim on file-backed memory with concurrent access
> c) Free page reporting as implemented by virtio-balloon
> 
> In all of these cases, you can have someone immediately re-access the
> page table and re-populate it.

In the performance test shown on the cover, we repeatedly performed
touch and madvise(MADV_DONTNEED) actions, which simulated the case
you said above.

We did find a small amount of performance regression, but I think it is
acceptable, and no new perf hotspots have been added.

> 
> For something mostly static (balloon inflation, memory allocator), it's
> not that big of a deal I guess.
>
David Hildenbrand Nov. 11, 2021, 12:20 p.m. UTC | #18
On 11.11.21 13:00, Qi Zheng wrote:
> 
> 
> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>> On 11.11.21 12:08, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>
>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>
>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>> left with empty page tables to reclaim.
>>>>>>
>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>> user space or automatically from the kernel.
>>>>>
>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>> table, there are two problems as follows:
>>>>>
>>>>> 	#1. When to trigger the scanning or releasing?
>>>>
>>>> For example when reclaiming memory, when scanning page tables in
>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>> originally looked into). But it certainly requires more locking thought
>>>> to avoid stopping essentially any page table walker.
>>>>
>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>> 	    entries need to be scanned.
>>>>
>>>> It would happen only when actually trigger reclaim of page tables
>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>
>>>> For example, khugepaged already scans the page tables either way.
>>>>
>>>>>
>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>
>>>>> For #2, refcount has advantages.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>> to deal with complicated locking.
>>>>>>>
>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>
>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>> too..
>>>>>>>
>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>> least let's have thought about a locking solution before merging
>>>>>>> refcounts :)
>>>>>>
>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>
>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>> space.
>>>>>>
>>>>>
>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>> that the release of page table pages can be delayed in some cases.
>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>
>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>> no more entries), the longer you wait with reclaim, the longer others
>>>> have to wait for populating a fresh page table because the "page table
>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>> increased for a while, and only drop it after a while. But when? And
>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>
>>>
>>> For running VMs with memory ballooning after inflating the balloon, is
>>> this a hot behavior? Even if it is, it is already facing the release and
>>> reallocation of physical pages. The overhead after introducing
>>> pte_refcount is that we need to release and re-allocate page table page.
>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>> So maybe the overhead is not big.
>>
>> The cases that come to my mind are
>>
>> a) Swapping on shared memory with concurrent access
>> b) Reclaim on file-backed memory with concurrent access
>> c) Free page reporting as implemented by virtio-balloon
>>
>> In all of these cases, you can have someone immediately re-access the
>> page table and re-populate it.
> 
> In the performance test shown on the cover, we repeatedly performed
> touch and madvise(MADV_DONTNEED) actions, which simulated the case
> you said above.
> 
> We did find a small amount of performance regression, but I think it is
> acceptable, and no new perf hotspots have been added.

That test always accesses 2MiB and does it from a single thread. Things
might (IMHO will) look different when only accessing individual pages
and doing the access from one/multiple separate threads (that's what
a),b) and c) essentially do, they don't do it in the pattern you
measured. what you measured matches rather a typical memory allocator).
Qi Zheng Nov. 11, 2021, 12:32 p.m. UTC | #19
On 11/11/21 8:20 PM, David Hildenbrand wrote:
> On 11.11.21 13:00, Qi Zheng wrote:
>>
>>
>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>
>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>
>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>> left with empty page tables to reclaim.
>>>>>>>
>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>> user space or automatically from the kernel.
>>>>>>
>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>> table, there are two problems as follows:
>>>>>>
>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>
>>>>> For example when reclaiming memory, when scanning page tables in
>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>> originally looked into). But it certainly requires more locking thought
>>>>> to avoid stopping essentially any page table walker.
>>>>>
>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>> 	    entries need to be scanned.
>>>>>
>>>>> It would happen only when actually trigger reclaim of page tables
>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>
>>>>> For example, khugepaged already scans the page tables either way.
>>>>>
>>>>>>
>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>
>>>>>> For #2, refcount has advantages.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>> to deal with complicated locking.
>>>>>>>>
>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>
>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>> too..
>>>>>>>>
>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>> refcounts :)
>>>>>>>
>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>
>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>> space.
>>>>>>>
>>>>>>
>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>
>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>> have to wait for populating a fresh page table because the "page table
>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>> increased for a while, and only drop it after a while. But when? And
>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>
>>>>
>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>> reallocation of physical pages. The overhead after introducing
>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>> So maybe the overhead is not big.
>>>
>>> The cases that come to my mind are
>>>
>>> a) Swapping on shared memory with concurrent access
>>> b) Reclaim on file-backed memory with concurrent access
>>> c) Free page reporting as implemented by virtio-balloon
>>>
>>> In all of these cases, you can have someone immediately re-access the
>>> page table and re-populate it.
>>
>> In the performance test shown on the cover, we repeatedly performed
>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>> you said above.
>>
>> We did find a small amount of performance regression, but I think it is
>> acceptable, and no new perf hotspots have been added.
> 
> That test always accesses 2MiB and does it from a single thread. Things
> might (IMHO will) look different when only accessing individual pages
> and doing the access from one/multiple separate threads (that's what

No, it includes multi-threading:

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;
		pthread_barrier_wait(&barrier);
		//printf("fault into %p-%p\n",start, end);

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;

		pthread_barrier_wait(&barrier);
		for (i = 0; cpu==0 && i < num; i++)
			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
		pthread_barrier_wait(&barrier);
	}

Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
memory of threads on other cpu.

> a),b) and c) essentially do, they don't do it in the pattern you
> measured. what you measured matches rather a typical memory allocator).
> 
>
David Hildenbrand Nov. 11, 2021, 12:51 p.m. UTC | #20
On 11.11.21 13:32, Qi Zheng wrote:
> 
> 
> On 11/11/21 8:20 PM, David Hildenbrand wrote:
>> On 11.11.21 13:00, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>>
>>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>>
>>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>>> left with empty page tables to reclaim.
>>>>>>>>
>>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>>> user space or automatically from the kernel.
>>>>>>>
>>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>>> table, there are two problems as follows:
>>>>>>>
>>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>>
>>>>>> For example when reclaiming memory, when scanning page tables in
>>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>>> originally looked into). But it certainly requires more locking thought
>>>>>> to avoid stopping essentially any page table walker.
>>>>>>
>>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>>> 	    entries need to be scanned.
>>>>>>
>>>>>> It would happen only when actually trigger reclaim of page tables
>>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>>
>>>>>> For example, khugepaged already scans the page tables either way.
>>>>>>
>>>>>>>
>>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>>
>>>>>>> For #2, refcount has advantages.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>>> to deal with complicated locking.
>>>>>>>>>
>>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>>
>>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>>> too..
>>>>>>>>>
>>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>>> refcounts :)
>>>>>>>>
>>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>>
>>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>>> space.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>>
>>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>>> have to wait for populating a fresh page table because the "page table
>>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>>> increased for a while, and only drop it after a while. But when? And
>>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>>
>>>>>
>>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>>> reallocation of physical pages. The overhead after introducing
>>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>>> So maybe the overhead is not big.
>>>>
>>>> The cases that come to my mind are
>>>>
>>>> a) Swapping on shared memory with concurrent access
>>>> b) Reclaim on file-backed memory with concurrent access
>>>> c) Free page reporting as implemented by virtio-balloon
>>>>
>>>> In all of these cases, you can have someone immediately re-access the
>>>> page table and re-populate it.
>>>
>>> In the performance test shown on the cover, we repeatedly performed
>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>> you said above.
>>>
>>> We did find a small amount of performance regression, but I think it is
>>> acceptable, and no new perf hotspots have been added.
>>
>> That test always accesses 2MiB and does it from a single thread. Things
>> might (IMHO will) look different when only accessing individual pages
>> and doing the access from one/multiple separate threads (that's what
> 
> No, it includes multi-threading:
> 

Oh sorry, I totally skipped [2].

> 	while (1) {
> 		char *c;
> 		char *start = mmap_area[cpu];
> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
> 		pthread_barrier_wait(&barrier);
> 		//printf("fault into %p-%p\n",start, end);
> 
> 		for (c = start; c < end; c += PAGE_SIZE)
> 			*c = 0;
> 
> 		pthread_barrier_wait(&barrier);
> 		for (i = 0; cpu==0 && i < num; i++)
> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
> 		pthread_barrier_wait(&barrier);
> 	}
> 
> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
> memory of threads on other cpu.
> 

I'll have a more detailed look at the benchmark. On a quick glimpse,
looks like the threads are also accessing a full 2MiB range, one page at
a time, and one thread is zapping the whole 2MiB range. A single CPU
only accesses memory within one 2MiB range IIRC.

Having multiple threads just access individual pages within a single 2
MiB region, and having one thread zap that memory (e.g., simulate
swapout) could be another benchmark.

We have to make sure to run with THP disabled (e.g., using
madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
eventually), because otherwise you might just be populating+zapping THPs
if they would otherwise be allowed in the environment.
Qi Zheng Nov. 11, 2021, 1:01 p.m. UTC | #21
On 11/11/21 8:51 PM, David Hildenbrand wrote:

>>>>
>>>> In the performance test shown on the cover, we repeatedly performed
>>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>>> you said above.
>>>>
>>>> We did find a small amount of performance regression, but I think it is
>>>> acceptable, and no new perf hotspots have been added.
>>>
>>> That test always accesses 2MiB and does it from a single thread. Things
>>> might (IMHO will) look different when only accessing individual pages
>>> and doing the access from one/multiple separate threads (that's what
>>
>> No, it includes multi-threading:
>>
> 
> Oh sorry, I totally skipped [2].
> 
>> 	while (1) {
>> 		char *c;
>> 		char *start = mmap_area[cpu];
>> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
>> 		pthread_barrier_wait(&barrier);
>> 		//printf("fault into %p-%p\n",start, end);
>>
>> 		for (c = start; c < end; c += PAGE_SIZE)
>> 			*c = 0;
>>
>> 		pthread_barrier_wait(&barrier);
>> 		for (i = 0; cpu==0 && i < num; i++)
>> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
>> 		pthread_barrier_wait(&barrier);
>> 	}
>>
>> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
>> memory of threads on other cpu.
>>
> 
> I'll have a more detailed look at the benchmark. On a quick glimpse,

Thank you for your time :)

> looks like the threads are also accessing a full 2MiB range, one page at
> a time, and one thread is zapping the whole 2MiB range. A single CPU
> only accesses memory within one 2MiB range IIRC.
> 
> Having multiple threads just access individual pages within a single 2
> MiB region, and having one thread zap that memory (e.g., simulate
> swapout) could be another benchmark.

LGTM, I will simulate more scenarios for testing.

> 
> We have to make sure to run with THP disabled (e.g., using
> madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
> eventually), because otherwise you might just be populating+zapping THPs
> if they would otherwise be allowed in the environment.

Yes, I turned off THP during testing:

root@~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

>