mbox series

[v5,0/2] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

Message ID 20240408042437.10951-1-ioworker0@gmail.com (mailing list archive)
Headers show
Series mm/madvise: enhance lazyfreeing with mTHP in madvise_free | expand

Message

Lance Yang April 8, 2024, 4:24 a.m. UTC
Hi All,

This patchset adds support for lazyfreeing multi-size THP (mTHP) without
needing to first split the large folio via split_folio(). However, we
still need to split a large folio that is not fully mapped within the
target range.

If a large folio is locked or shared, or if we fail to split it, we just
leave it in place and advance to the next PTE in the range. But note that
the behavior is changed; previously, any failure of this sort would cause
the entire operation to give up. As large folios become more common,
sticking to the old way could result in wasted opportunities.

Performance Testing
===================

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
the same size results in the following runtimes for madvise(MADV_FREE)
in seconds (shorter is better):

Folio Size |   Old    |   New    | Change
------------------------------------------
      4KiB | 0.590251 | 0.590259 |    0%
     16KiB | 2.990447 | 0.185655 |  -94%
     32KiB | 2.547831 | 0.104870 |  -95%
     64KiB | 2.457796 | 0.052812 |  -97%
    128KiB | 2.281034 | 0.032777 |  -99%
    256KiB | 2.230387 | 0.017496 |  -99%
    512KiB | 2.189106 | 0.010781 |  -99%
   1024KiB | 2.183949 | 0.007753 |  -99%
   2048KiB | 0.002799 | 0.002804 |    0%

---
This patchset applies against mm-unstable (f43b3aae9451). 

The performance numbers are from v2. I did a quick benchmark run of v5 and
nothing significantly changed.

Changes since v4 [4]
====================
 - The first patch implements the MADV_FREE change and introduces
   mkold_clean_ptes() with a generic implementation. The second patch
   specializes mkold_clean_ptes() for arm64, providing a performance boost
   specific to arm64 (per Ryan Roberts)
 - Drop the full parameter and call ptep_get_and_clear() in mkold_clean_ptes()
   (per Ryan Roberts)
 - Keep the previous behavior that avoids locking the folio if it wasn't in the
   swapcache or if it wasn't dirty (per Ryan Roberts)

Changes since v3 [3]
====================
 - Rename refresh_full_ptes -> mkold_clean_ptes (per Ryan Roberts)
 - Override mkold_clean_ptes() for arm64 to make it faster (per Ryan Roberts)
 - Update the changelog

Changes since v2 [2]
====================
 - Only skip all the PTEs for nr_pages when the number of batched PTEs matches
   nr_pages (per Barry Song)
 - Change folio_pte_batch() to consume an optional *any_dirty and *any_young
   function (per David Hildenbrand)
 - Move the ptep_get_and_clear_full() loop into refresh_full_ptes() (per
   David Hildenbrand)
 - Follow a similar pattern for madvise_free_pte_range() (per Ryan Roberts)

Changes since v1 [1]
====================
 - Update the performance numbers
 - Update the changelog (per Ryan Roberts)
 - Check the COW folio (per Yin Fengwei)
 - Check if we are mapping all subpages (per Barry Song, David Hildenbrand,
   Ryan Roberts)

[1] https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@gmail.com
[2] https://lore.kernel.org/linux-mm/20240307061425.21013-1-ioworker0@gmail.com
[3] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
[4] https://lore.kernel.org/linux-mm/20240402124029.47846-1-ioworker0@gmail.com

Thanks,
Lance

Lance Yang (2):
 mm/madvise: optimize lazyfreeing with mTHP in madvise_free
 mm/arm64: override mkold_clean_ptes() batch helper

 arch/arm64/include/asm/pgtable.h |  57 +++++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c          |  15 +++++++++
 include/linux/pgtable.h          |  35 ++++++++++++++++++++
 mm/internal.h                    |  12 +++++--
 mm/madvise.c                     | 149 +++++++++++++++++++++++++++++++++++----
 mm/memory.c                      |   4 +--
 6 files changed, 202 insertions(+), 70 deletions(-)

Comments

Andrew Morton April 10, 2024, 9:50 p.m. UTC | #1
On Mon,  8 Apr 2024 12:24:35 +0800 Lance Yang <ioworker0@gmail.com> wrote:

> Hi All,
> 
> This patchset adds support for lazyfreeing multi-size THP (mTHP) without
> needing to first split the large folio via split_folio(). However, we
> still need to split a large folio that is not fully mapped within the
> target range.
> 
> If a large folio is locked or shared, or if we fail to split it, we just
> leave it in place and advance to the next PTE in the range. But note that
> the behavior is changed; previously, any failure of this sort would cause
> the entire operation to give up. As large folios become more common,
> sticking to the old way could result in wasted opportunities.
> 
> Performance Testing
> ===================
> 
> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
> the same size results in the following runtimes for madvise(MADV_FREE)
> in seconds (shorter is better):
> 
> Folio Size |   Old    |   New    | Change
> ------------------------------------------
>       4KiB | 0.590251 | 0.590259 |    0%
>      16KiB | 2.990447 | 0.185655 |  -94%
>      32KiB | 2.547831 | 0.104870 |  -95%
>      64KiB | 2.457796 | 0.052812 |  -97%
>     128KiB | 2.281034 | 0.032777 |  -99%
>     256KiB | 2.230387 | 0.017496 |  -99%
>     512KiB | 2.189106 | 0.010781 |  -99%
>    1024KiB | 2.183949 | 0.007753 |  -99%
>    2048KiB | 0.002799 | 0.002804 |    0%

That looks nice but punting work to another thread can slightly
increase overall system load and can mess up utilization accounting by
attributing work to threads which didn't initiate that work.

And there's a corner-case risk where the thread running madvise() has
realtime policy (SCHED_RR/SCHED_FIFO) on a single-CPU system,
preventing any other threads from executing, resulting in indefinitely
deferred freeing resulting in memory squeezes or even OOM conditions.

It would be good if the changelog(s) were to show some consideration of
such matters and some demonstration that the benefits exceed the risks
and costs.
Lance Yang April 11, 2024, 5:01 a.m. UTC | #2
On Thu, Apr 11, 2024 at 5:50 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon,  8 Apr 2024 12:24:35 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>
> > Hi All,
> >
> > This patchset adds support for lazyfreeing multi-size THP (mTHP) without
> > needing to first split the large folio via split_folio(). However, we
> > still need to split a large folio that is not fully mapped within the
> > target range.
> >
> > If a large folio is locked or shared, or if we fail to split it, we just
> > leave it in place and advance to the next PTE in the range. But note that
> > the behavior is changed; previously, any failure of this sort would cause
> > the entire operation to give up. As large folios become more common,
> > sticking to the old way could result in wasted opportunities.
> >
> > Performance Testing
> > ===================
> >
> > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
> > the same size results in the following runtimes for madvise(MADV_FREE)
> > in seconds (shorter is better):
> >
> > Folio Size |   Old    |   New    | Change
> > ------------------------------------------
> >       4KiB | 0.590251 | 0.590259 |    0%
> >      16KiB | 2.990447 | 0.185655 |  -94%
> >      32KiB | 2.547831 | 0.104870 |  -95%
> >      64KiB | 2.457796 | 0.052812 |  -97%
> >     128KiB | 2.281034 | 0.032777 |  -99%
> >     256KiB | 2.230387 | 0.017496 |  -99%
> >     512KiB | 2.189106 | 0.010781 |  -99%
> >    1024KiB | 2.183949 | 0.007753 |  -99%
> >    2048KiB | 0.002799 | 0.002804 |    0%
>
> That looks nice but punting work to another thread can slightly
> increase overall system load and can mess up utilization accounting by
> attributing work to threads which didn't initiate that work.
>
> And there's a corner-case risk where the thread running madvise() has
> realtime policy (SCHED_RR/SCHED_FIFO) on a single-CPU system,
> preventing any other threads from executing, resulting in indefinitely
> deferred freeing resulting in memory squeezes or even OOM conditions.
>
> It would be good if the changelog(s) were to show some consideration of
> such matters and some demonstration that the benefits exceed the risks
> and costs.
>

Hey Andrew,

Thanks for bringing up these concerns!

I completely agree that we need to consider such masters and include
them into the changelog(s). Additionally, I'll do my best to show that the
benefits exceed the risks and costs, and then update the changelog(s)
accordingly.

Thanks again for your time!
Lance
Ryan Roberts April 11, 2024, 10:29 a.m. UTC | #3
On 11/04/2024 06:01, Lance Yang wrote:
> On Thu, Apr 11, 2024 at 5:50 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Mon,  8 Apr 2024 12:24:35 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> This patchset adds support for lazyfreeing multi-size THP (mTHP) without
>>> needing to first split the large folio via split_folio(). However, we
>>> still need to split a large folio that is not fully mapped within the
>>> target range.
>>>
>>> If a large folio is locked or shared, or if we fail to split it, we just
>>> leave it in place and advance to the next PTE in the range. But note that
>>> the behavior is changed; previously, any failure of this sort would cause
>>> the entire operation to give up. As large folios become more common,
>>> sticking to the old way could result in wasted opportunities.
>>>
>>> Performance Testing
>>> ===================
>>>
>>> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
>>> the same size results in the following runtimes for madvise(MADV_FREE)
>>> in seconds (shorter is better):
>>>
>>> Folio Size |   Old    |   New    | Change
>>> ------------------------------------------
>>>       4KiB | 0.590251 | 0.590259 |    0%
>>>      16KiB | 2.990447 | 0.185655 |  -94%
>>>      32KiB | 2.547831 | 0.104870 |  -95%
>>>      64KiB | 2.457796 | 0.052812 |  -97%
>>>     128KiB | 2.281034 | 0.032777 |  -99%
>>>     256KiB | 2.230387 | 0.017496 |  -99%
>>>     512KiB | 2.189106 | 0.010781 |  -99%
>>>    1024KiB | 2.183949 | 0.007753 |  -99%
>>>    2048KiB | 0.002799 | 0.002804 |    0%
>>
>> That looks nice but punting work to another thread can slightly
>> increase overall system load and can mess up utilization accounting by
>> attributing work to threads which didn't initiate that work.

My understanding is that this speedup is all coming from the avoidance of
splitting folios synchonously in the context of madvise(MADV_FREE). It's not
actually punting anymore work to be done lazily, its just avoiding doing extra
uneccessary work up front. In fact, it would result in less work to do at
lazyfree time because the folios remain large so there are fewer folios to free.

Perhaps I've misunderstood your point?

Thanks,
Ryan

>>
>> And there's a corner-case risk where the thread running madvise() has
>> realtime policy (SCHED_RR/SCHED_FIFO) on a single-CPU system,
>> preventing any other threads from executing, resulting in indefinitely
>> deferred freeing resulting in memory squeezes or even OOM conditions.
>>
>> It would be good if the changelog(s) were to show some consideration of
>> such matters and some demonstration that the benefits exceed the risks
>> and costs.
>>
> 
> Hey Andrew,
> 
> Thanks for bringing up these concerns!
> 
> I completely agree that we need to consider such masters and include
> them into the changelog(s). Additionally, I'll do my best to show that the
> benefits exceed the risks and costs, and then update the changelog(s)
> accordingly.
> 
> Thanks again for your time!
> Lance