[RFC,v1,0/5] Alternative mTHP swap allocator improvements

Message ID	20240618232648.4090299-1-ryan.roberts@arm.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>, "Huang, Ying" <ying.huang@intel.com>, Kalesh Singh <kaleshsingh@google.com>, Barry Song <baohua@kernel.org>, Hugh Dickins <hughd@google.com>, David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Date: Wed, 19 Jun 2024 00:26:40 +0100 Message-ID: <20240618232648.4090299-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Alternative mTHP swap allocator improvements \| expand [RFC,v1,0/5] Alternative mTHP swap allocator improvements [RFC,v1,1/5] mm: swap: Simplify end-of-cluster calculation [RFC,v1,2/5] mm: swap: Change SWAP_NEXT_INVALID to highest value [RFC,v1,3/5] mm: swap: Track allocation order for clusters [RFC,v1,4/5] mm: swap: Scan for free swap entries in allocated clusters [RFC,v1,5/5] mm: swap: Optimize per-order cluster scanning

Ryan Roberts June 18, 2024, 11:26 p.m. UTC

Hi All,

Chris has been doing great work at [1] to clean up my mess in the mTHP swap
entry allocator. But Barry posted a test program and results at [2] showing that
even with Chris's changes, there are still some fallbacks (around 5% - 25% in
some cases). I was interested in why that might be and ended up putting this PoC
patch set together to try to get a better understanding. This series ends up
achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
much testing beyond that (yet) but thought it was worth posting on the strength
of that result alone.

At a high level this works in a similar way to Chris's series; it marks a
cluster as being for a particular order and if a new cluster cannot be allocated
then it scans through the existing non-full clusters. But it does it by scanning
through the clusters rather than assembling them into a list. Cluster flags are
used to mark clusters that have been scanned and are known not to have enough
contiguous space, so the efficiency should be similar in practice.

Because its not based around a linked list, there is less churn and I'm
wondering if this is perhaps easier to review and potentially even get into
v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
for Chris's series? I know Chris has a larger roadmap of improvements, so at
best I see this as a tactical fix that will ultimately be superseeded by Chris's
work.

There are a few differences to note vs Chris's series:

- order-0 fallback scanning is still allowed in any cluster; the argument in the
past was that swap should always use all the swap space, so I've left this
mechanism in. It is only a fallback though; first the the new per-order
scanner is invoked, even for order-0, so if there are free slots in clusters
already assigned for order-0, then the allocation will go there.

- CPUs can steal slots from other CPU's current clusters; those clusters remain
scannable while they are current for a CPU and are only made unscannable when
no more CPUs are scanning that particular cluster.

- I'm preferring to allocate a free cluster ahead of per-order scanning, since,
as I understand it, the original intent of a per-cpu current cluster was to
get pages for an application adjacent in the swap to speed up IO.

I'd be keen to hear if you think we could get something like this into v6.10 to
fix the mess - I'm willing to work quickly to address comments and do more
testing. If not, then this is probably just a distraction and we should
concentrate on Chris's series.

This applies on top of v6.10-rc4.

[1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
[2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/

Thanks,
Ryan

Ryan Roberts (5):
mm: swap: Simplify end-of-cluster calculation
mm: swap: Change SWAP_NEXT_INVALID to highest value
mm: swap: Track allocation order for clusters
mm: swap: Scan for free swap entries in allocated clusters
mm: swap: Optimize per-order cluster scanning

include/linux/swap.h | 18 +++--
mm/swapfile.c | 164 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 157 insertions(+), 25 deletions(-)

--
2.43.0

Huang, Ying June 19, 2024, 7:19 a.m. UTC | #1

Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator.

I don't think the original behavior is something like mess.  It's just
the first step in the correct direction.  It's straightforward and
obviously correctly.  Then, we can optimize it step by step with data to
justify the increased complexity.

> But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
>
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
>
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.

I don't think we need any mTHP swap entry allocation optimization to go
into v6.10-rcX.  There's no functionality or performance regression.
Per my understanding, we merge optimization when it's ready.

Hi, Andrew,

Please correct me if you don't agree.

[snip]

--
Best Regards,
Huang, Ying

Barry Song June 19, 2024, 9:11 a.m. UTC | #2

On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator. But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
>
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
>
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.
>
> There are a few differences to note vs Chris's series:
>
> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>   past was that swap should always use all the swap space, so I've left this
>   mechanism in. It is only a fallback though; first the the new per-order
>   scanner is invoked, even for order-0, so if there are free slots in clusters
>   already assigned for order-0, then the allocation will go there.
>
> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>   scannable while they are current for a CPU and are only made unscannable when
>   no more CPUs are scanning that particular cluster.
>
> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>   as I understand it, the original intent of a per-cpu current cluster was to
>   get pages for an application adjacent in the swap to speed up IO.
>
> I'd be keen to hear if you think we could get something like this into v6.10 to
> fix the mess - I'm willing to work quickly to address comments and do more
> testing. If not, then this is probably just a distraction and we should
> concentrate on Chris's series.

Ryan, thank you very much for accomplishing this.

I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
add_to_swap() for both your approach and Chris's. I will update you with
the results ASAP.

I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
caused a couple of crashes.

>
> This applies on top of v6.10-rc4.
>
> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
>
> Thanks,
> Ryan
>
> Ryan Roberts (5):
>   mm: swap: Simplify end-of-cluster calculation
>   mm: swap: Change SWAP_NEXT_INVALID to highest value
>   mm: swap: Track allocation order for clusters
>   mm: swap: Scan for free swap entries in allocated clusters
>   mm: swap: Optimize per-order cluster scanning
>
>  include/linux/swap.h |  18 +++--
>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 157 insertions(+), 25 deletions(-)
>
> --
> 2.43.0
>

Thanks
Barry

Ryan Roberts June 19, 2024, 9:17 a.m. UTC | #3

On 19/06/2024 08:19, Huang, Ying wrote:
> Hi, Ryan,
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi All,
>>
>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>> entry allocator.
> 
> I don't think the original behavior is something like mess.  It's just
> the first step in the correct direction.  It's straightforward and
> obviously correctly.  Then, we can optimize it step by step with data to
> justify the increased complexity.

OK, perhaps I was over-egging it by calling it a "mess". What you're describing
was my initial opinion too, but I saw Andrew complaining that we shouldn't be
merging a feature if it doesn't work. This series fixes the problem in a minimal
way - if you ignore the last patch, which is really is just a performance
optimization and could be dropped.

If we can ultimately get Chris's series to 0% fallback like this one, and
everyone is happy with the current state for v6.10, then agreed - let's
concentrate on Chris's series for v6.11.

Thanks,
Ryan

> 
>> But Barry posted a test program and results at [2] showing that
>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>> some cases). I was interested in why that might be and ended up putting this PoC
>> patch set together to try to get a better understanding. This series ends up
>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>> much testing beyond that (yet) but thought it was worth posting on the strength
>> of that result alone.
>>
>> At a high level this works in a similar way to Chris's series; it marks a
>> cluster as being for a particular order and if a new cluster cannot be allocated
>> then it scans through the existing non-full clusters. But it does it by scanning
>> through the clusters rather than assembling them into a list. Cluster flags are
>> used to mark clusters that have been scanned and are known not to have enough
>> contiguous space, so the efficiency should be similar in practice.
>>
>> Because its not based around a linked list, there is less churn and I'm
>> wondering if this is perhaps easier to review and potentially even get into
>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>> work.
> 
> I don't think we need any mTHP swap entry allocation optimization to go
> into v6.10-rcX.  There's no functionality or performance regression.
> Per my understanding, we merge optimization when it's ready.
> 
> Hi, Andrew,
> 
> Please correct me if you don't agree.
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying

Ryan Roberts June 19, 2024, 9:17 a.m. UTC | #4

On 19/06/2024 10:11, Barry Song wrote:
> On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>> entry allocator. But Barry posted a test program and results at [2] showing that
>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>> some cases). I was interested in why that might be and ended up putting this PoC
>> patch set together to try to get a better understanding. This series ends up
>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>> much testing beyond that (yet) but thought it was worth posting on the strength
>> of that result alone.
>>
>> At a high level this works in a similar way to Chris's series; it marks a
>> cluster as being for a particular order and if a new cluster cannot be allocated
>> then it scans through the existing non-full clusters. But it does it by scanning
>> through the clusters rather than assembling them into a list. Cluster flags are
>> used to mark clusters that have been scanned and are known not to have enough
>> contiguous space, so the efficiency should be similar in practice.
>>
>> Because its not based around a linked list, there is less churn and I'm
>> wondering if this is perhaps easier to review and potentially even get into
>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>> work.
>>
>> There are a few differences to note vs Chris's series:
>>
>> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>>   past was that swap should always use all the swap space, so I've left this
>>   mechanism in. It is only a fallback though; first the the new per-order
>>   scanner is invoked, even for order-0, so if there are free slots in clusters
>>   already assigned for order-0, then the allocation will go there.
>>
>> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>>   scannable while they are current for a CPU and are only made unscannable when
>>   no more CPUs are scanning that particular cluster.
>>
>> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>>   as I understand it, the original intent of a per-cpu current cluster was to
>>   get pages for an application adjacent in the swap to speed up IO.
>>
>> I'd be keen to hear if you think we could get something like this into v6.10 to
>> fix the mess - I'm willing to work quickly to address comments and do more
>> testing. If not, then this is probably just a distraction and we should
>> concentrate on Chris's series.
> 
> Ryan, thank you very much for accomplishing this.
> 
> I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
> add_to_swap() for both your approach and Chris's. I will update you with
> the results ASAP.

Ahh great - look forward to the results!

> 
> I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
> caused a couple of crashes.
> 
>>
>> This applies on top of v6.10-rc4.
>>
>> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
>> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
>>
>> Thanks,
>> Ryan
>>
>> Ryan Roberts (5):
>>   mm: swap: Simplify end-of-cluster calculation
>>   mm: swap: Change SWAP_NEXT_INVALID to highest value
>>   mm: swap: Track allocation order for clusters
>>   mm: swap: Scan for free swap entries in allocated clusters
>>   mm: swap: Optimize per-order cluster scanning
>>
>>  include/linux/swap.h |  18 +++--
>>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
>>  2 files changed, 157 insertions(+), 25 deletions(-)
>>
>> --
>> 2.43.0
>>
> 
> Thanks
> Barry

Huang, Ying June 20, 2024, 1:34 a.m. UTC | #5

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 19/06/2024 08:19, Huang, Ying wrote:
>> Hi, Ryan,
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Hi All,
>>>
>>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>>> entry allocator.
>> 
>> I don't think the original behavior is something like mess.  It's just
>> the first step in the correct direction.  It's straightforward and
>> obviously correctly.  Then, we can optimize it step by step with data to
>> justify the increased complexity.
>
> OK, perhaps I was over-egging it by calling it a "mess". What you're describing
> was my initial opinion too, but I saw Andrew complaining that we shouldn't be
> merging a feature if it doesn't work.

I don't think it doesn't work.  It just works well for some workloads,
but doesn't work well for some other workloads.  We can always improve
the implementation to make it works better for more workloads.

From another point of view, even with your and Chris's improvement, mTHP
swap entry allocation will fallback for quite some workloads.

Use page allocator as an example.  Even in current kernel, THP allocation
may fallback for quite some workloads because of page fragmentation.
But I don't think THP doesn't work now.

> This series fixes the problem in a minimal
> way - if you ignore the last patch, which is really is just a performance
> optimization and could be dropped.
>
> If we can ultimately get Chris's series to 0% fallback like this one, and
> everyone is happy with the current state for v6.10, then agreed - let's
> concentrate on Chris's series for v6.11.

[snip]

--
Best Regards,
Huang, Ying

Barry Song June 21, 2024, 8:48 a.m. UTC | #6

On Wed, Jun 19, 2024 at 9:18 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/06/2024 10:11, Barry Song wrote:
> > On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> >> entry allocator. But Barry posted a test program and results at [2] showing that
> >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> >> some cases). I was interested in why that might be and ended up putting this PoC
> >> patch set together to try to get a better understanding. This series ends up
> >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> >> much testing beyond that (yet) but thought it was worth posting on the strength
> >> of that result alone.
> >>
> >> At a high level this works in a similar way to Chris's series; it marks a
> >> cluster as being for a particular order and if a new cluster cannot be allocated
> >> then it scans through the existing non-full clusters. But it does it by scanning
> >> through the clusters rather than assembling them into a list. Cluster flags are
> >> used to mark clusters that have been scanned and are known not to have enough
> >> contiguous space, so the efficiency should be similar in practice.
> >>
> >> Because its not based around a linked list, there is less churn and I'm
> >> wondering if this is perhaps easier to review and potentially even get into
> >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> >> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> >> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> >> work.
> >>
> >> There are a few differences to note vs Chris's series:
> >>
> >> - order-0 fallback scanning is still allowed in any cluster; the argument in the
> >>   past was that swap should always use all the swap space, so I've left this
> >>   mechanism in. It is only a fallback though; first the the new per-order
> >>   scanner is invoked, even for order-0, so if there are free slots in clusters
> >>   already assigned for order-0, then the allocation will go there.
> >>
> >> - CPUs can steal slots from other CPU's current clusters; those clusters remain
> >>   scannable while they are current for a CPU and are only made unscannable when
> >>   no more CPUs are scanning that particular cluster.
> >>
> >> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
> >>   as I understand it, the original intent of a per-cpu current cluster was to
> >>   get pages for an application adjacent in the swap to speed up IO.
> >>
> >> I'd be keen to hear if you think we could get something like this into v6.10 to
> >> fix the mess - I'm willing to work quickly to address comments and do more
> >> testing. If not, then this is probably just a distraction and we should
> >> concentrate on Chris's series.
> >
> > Ryan, thank you very much for accomplishing this.
> >
> > I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
> > add_to_swap() for both your approach and Chris's. I will update you with
> > the results ASAP.
>
> Ahh great - look forward to the results!

Essentially, we are measuring two types of latency:
* Small folio swap allocation
 * Large folio swap allocation

The concept code is like

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 994723cef821..a608b916ed2f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,10 +185,18 @@ bool add_to_swap(struct folio *folio)
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
        VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);

+       start_time = ktime_get();
+
        entry = folio_alloc_swap(folio);
        if (!entry.val)
                return false;

+       end_time = ktime_get();
+       if (folio_test_large(folio))
+               trace_large_swap_allocation_latency(ktime_sub(end_time
- start_time));
+       else
+               trace_small_swap_allocation_latency(ktime_sub(end_time
- start_time));
+
        /*
         * XArray node allocations from PF_MEMALLOC contexts could
         * completely exhaust the page allocator. __GFP_NOMEMALLOC


Then, we'll generate histograms for both large and small allocation
latency. We're currently
encountering some setup issues. Once we have the data, I'll provide
updates to you and Chris.
Additionally, I noticed some comments suggesting that Chris's patch
might negatively impact
the swap allocation latency of small folios. Perhaps the data can help
clarify this.

>
> >
> > I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
> > caused a couple of crashes.
> >
> >>
> >> This applies on top of v6.10-rc4.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
> >> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
> >>
> >> Thanks,
> >> Ryan
> >>
> >> Ryan Roberts (5):
> >>   mm: swap: Simplify end-of-cluster calculation
> >>   mm: swap: Change SWAP_NEXT_INVALID to highest value
> >>   mm: swap: Track allocation order for clusters
> >>   mm: swap: Scan for free swap entries in allocated clusters
> >>   mm: swap: Optimize per-order cluster scanning
> >>
> >>  include/linux/swap.h |  18 +++--
> >>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
> >>  2 files changed, 157 insertions(+), 25 deletions(-)
> >>
> >> --
> >> 2.43.0
> >>
>

Thanks
Barry

Ryan Roberts June 25, 2024, 8:02 a.m. UTC | #7

On 19/06/2024 00:26, Ryan Roberts wrote:
> Hi All,
> 
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator. But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
> 
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
> 
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.
> 
> There are a few differences to note vs Chris's series:
> 
> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>   past was that swap should always use all the swap space, so I've left this
>   mechanism in. It is only a fallback though; first the the new per-order
>   scanner is invoked, even for order-0, so if there are free slots in clusters
>   already assigned for order-0, then the allocation will go there.
> 
> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>   scannable while they are current for a CPU and are only made unscannable when
>   no more CPUs are scanning that particular cluster.
> 
> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>   as I understand it, the original intent of a per-cpu current cluster was to
>   get pages for an application adjacent in the swap to speed up IO.

[...]

I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).

However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.

But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?

Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). 

With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?


diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6d78723f0fd..928d61e1d9d5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 }
 
 static inline bool swap_range_empty(char *swap_map, unsigned int start,
-                                   unsigned int nr_pages)
+                                   unsigned int nr_pages, bool *inc_cached)
 {
        unsigned int i;
-
-       for (i = 0; i < nr_pages; i++) {
-               if (swap_map[start + i])
-                       return false;
+       bool hit_cache = false;
+
+       if (*inc_cached) {
+               for (i = 0; i < nr_pages; i++) {
+                       if (swap_map[start + i]) {
+                               if (swap_map[start + i] != SWAP_HAS_CACHE)
+                                       return false;
+                               hit_cache = true;
+                       }
+               }
+               *inc_cached = hit_cache;
+       } else {
+               for (i = 0; i < nr_pages; i++) {
+                       if (swap_map[start + i])
+                               return false;
+               }
        }
 
        return true;
@@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
        struct swap_cluster_info *ci;
        unsigned int tmp, max;
        unsigned int stop = SWAP_NEXT_INVALID;
+       bool evict = true;
 
 new_cluster:
        cluster = this_cpu_ptr(si->percpu_cluster);
@@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
         * natural alignment.
         */
        max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
-       ci = lock_cluster(si, tmp);
-       while (tmp < max) {
-               if (swap_range_empty(si->swap_map, tmp, nr_pages))
-                       break;
-               tmp += nr_pages;
+scan_cluster:
+       if (tmp < max) {
+               ci = lock_cluster(si, tmp);
+               while (tmp < max) {
+                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
+                               break;
+                       tmp += nr_pages;
+               }
+               unlock_cluster(ci);
        }
-       unlock_cluster(ci);
        if (tmp >= max) {
                cluster_dec_scanners(ci);
                cluster->next[order] = SWAP_NEXT_INVALID;
                goto new_cluster;
        }
+       if (evict) {
+               unsigned int off;
+               int nr;
+
+               spin_unlock(&si->lock);
+
+               for (off = tmp; off < tmp + nr_pages; off += nr) {
+                       nr = 1;
+                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
+                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
+                               if (nr < 0)
+                                       break;
+                               else if (nr == 0)
+                                       nr = 1;
+                               nr = ALIGN(off + 1, nr) - off;
+                       }
+               }
+
+               spin_lock(&si->lock);
+               *scan_base = this_cpu_read(*si->cluster_next_cpu);
+               *offset = *scan_base;
+
+               if (nr < 0)
+                       tmp += nr_pages;
+
+               goto scan_cluster;
+       }
        *offset = tmp;
        *scan_base = tmp;
        tmp += nr_pages;

Chris Li June 25, 2024, 4:06 p.m. UTC | #8

On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/06/2024 00:26, Ryan Roberts wrote:
> > Hi All,
> >
> > Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> > entry allocator. But Barry posted a test program and results at [2] showing that
> > even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> > some cases). I was interested in why that might be and ended up putting this PoC
> > patch set together to try to get a better understanding. This series ends up
> > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> > much testing beyond that (yet) but thought it was worth posting on the strength
> > of that result alone.
> >
> > At a high level this works in a similar way to Chris's series; it marks a
> > cluster as being for a particular order and if a new cluster cannot be allocated
> > then it scans through the existing non-full clusters. But it does it by scanning
> > through the clusters rather than assembling them into a list. Cluster flags are
> > used to mark clusters that have been scanned and are known not to have enough
> > contiguous space, so the efficiency should be similar in practice.
> >
> > Because its not based around a linked list, there is less churn and I'm
> > wondering if this is perhaps easier to review and potentially even get into
> > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> > for Chris's series? I know Chris has a larger roadmap of improvements, so at
> > best I see this as a tactical fix that will ultimately be superseeded by Chris's
> > work.
> >
> > There are a few differences to note vs Chris's series:
> >
> > - order-0 fallback scanning is still allowed in any cluster; the argument in the
> >   past was that swap should always use all the swap space, so I've left this
> >   mechanism in. It is only a fallback though; first the the new per-order
> >   scanner is invoked, even for order-0, so if there are free slots in clusters
> >   already assigned for order-0, then the allocation will go there.
> >
> > - CPUs can steal slots from other CPU's current clusters; those clusters remain
> >   scannable while they are current for a CPU and are only made unscannable when
> >   no more CPUs are scanning that particular cluster.
> >
> > - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
> >   as I understand it, the original intent of a per-cpu current cluster was to
> >   get pages for an application adjacent in the swap to speed up IO.
>
> [...]
>
> I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).
>
> However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.
>
> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?

I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE
which contribute to the swap allocation failure rate.

One possibility is that the zram is hitting the synchronous IO case
for swap in, it does not put the folio in swap cache when serving the
swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from
Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order
to address a race in the swap in. But the SWAP_HAS_CACHE should be
cleaned up after swap in fault though. I did not have a chance to dig
deeper into the root cause of those SWAP_HAS_CACHE entries yet.

Chris

>
> Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed).
>
> With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?
>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6d78723f0fd..928d61e1d9d5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  }
>
>  static inline bool swap_range_empty(char *swap_map, unsigned int start,
> -                                   unsigned int nr_pages)
> +                                   unsigned int nr_pages, bool *inc_cached)
>  {
>         unsigned int i;
> -
> -       for (i = 0; i < nr_pages; i++) {
> -               if (swap_map[start + i])
> -                       return false;
> +       bool hit_cache = false;
> +
> +       if (*inc_cached) {
> +               for (i = 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i]) {
> +                               if (swap_map[start + i] != SWAP_HAS_CACHE)
> +                                       return false;
> +                               hit_cache = true;
> +                       }
> +               }
> +               *inc_cached = hit_cache;
> +       } else {
> +               for (i = 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i])
> +                               return false;
> +               }
>         }
>
>         return true;
> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>         struct swap_cluster_info *ci;
>         unsigned int tmp, max;
>         unsigned int stop = SWAP_NEXT_INVALID;
> +       bool evict = true;
>
>  new_cluster:
>         cluster = this_cpu_ptr(si->percpu_cluster);
> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>          * natural alignment.
>          */
>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
> -       ci = lock_cluster(si, tmp);
> -       while (tmp < max) {
> -               if (swap_range_empty(si->swap_map, tmp, nr_pages))
> -                       break;
> -               tmp += nr_pages;
> +scan_cluster:
> +       if (tmp < max) {
> +               ci = lock_cluster(si, tmp);
> +               while (tmp < max) {
> +                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
> +                               break;
> +                       tmp += nr_pages;
> +               }
> +               unlock_cluster(ci);
>         }
> -       unlock_cluster(ci);
>         if (tmp >= max) {
>                 cluster_dec_scanners(ci);
>                 cluster->next[order] = SWAP_NEXT_INVALID;
>                 goto new_cluster;
>         }
> +       if (evict) {
> +               unsigned int off;
> +               int nr;
> +
> +               spin_unlock(&si->lock);
> +
> +               for (off = tmp; off < tmp + nr_pages; off += nr) {
> +                       nr = 1;
> +                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
> +                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
> +                               if (nr < 0)
> +                                       break;
> +                               else if (nr == 0)
> +                                       nr = 1;
> +                               nr = ALIGN(off + 1, nr) - off;
> +                       }
> +               }
> +
> +               spin_lock(&si->lock);
> +               *scan_base = this_cpu_read(*si->cluster_next_cpu);
> +               *offset = *scan_base;
> +
> +               if (nr < 0)
> +                       tmp += nr_pages;
> +
> +               goto scan_cluster;
> +       }
>         *offset = tmp;
>         *scan_base = tmp;
>         tmp += nr_pages;
>
>

Ryan Roberts June 25, 2024, 4:36 p.m. UTC | #9

On 25/06/2024 17:06, Chris Li wrote:
> On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/06/2024 00:26, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>>> entry allocator. But Barry posted a test program and results at [2] showing that
>>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>>> some cases). I was interested in why that might be and ended up putting this PoC
>>> patch set together to try to get a better understanding. This series ends up
>>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>>> much testing beyond that (yet) but thought it was worth posting on the strength
>>> of that result alone.
>>>
>>> At a high level this works in a similar way to Chris's series; it marks a
>>> cluster as being for a particular order and if a new cluster cannot be allocated
>>> then it scans through the existing non-full clusters. But it does it by scanning
>>> through the clusters rather than assembling them into a list. Cluster flags are
>>> used to mark clusters that have been scanned and are known not to have enough
>>> contiguous space, so the efficiency should be similar in practice.
>>>
>>> Because its not based around a linked list, there is less churn and I'm
>>> wondering if this is perhaps easier to review and potentially even get into
>>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>>> work.
>>>
>>> There are a few differences to note vs Chris's series:
>>>
>>> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>>>   past was that swap should always use all the swap space, so I've left this
>>>   mechanism in. It is only a fallback though; first the the new per-order
>>>   scanner is invoked, even for order-0, so if there are free slots in clusters
>>>   already assigned for order-0, then the allocation will go there.
>>>
>>> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>>>   scannable while they are current for a CPU and are only made unscannable when
>>>   no more CPUs are scanning that particular cluster.
>>>
>>> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>>>   as I understand it, the original intent of a per-cpu current cluster was to
>>>   get pages for an application adjacent in the swap to speed up IO.
>>
>> [...]
>>
>> I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).
>>
>> However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.
>>
>> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?
> 
> I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE
> which contribute to the swap allocation failure rate.
> 
> One possibility is that the zram is hitting the synchronous IO case
> for swap in, it does not put the folio in swap cache when serving the
> swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from
> Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order
> to address a race in the swap in. But the SWAP_HAS_CACHE should be
> cleaned up after swap in fault though. I did not have a chance to dig
> deeper into the root cause of those SWAP_HAS_CACHE entries yet.

Oh wait, I think they are probably in the swap slot cache waiting to be freed.
Once the count goes to 0 (inc SWAP_HAS_CACHE), __swap_entry_free_locked
temporarily adds SWAP_HAS_CACHE back in, and free_swap_slot() is called. The
entry is buffered in that layer until the buffer is full and they are all
flushed together. So should be able to trigger that flush to free some slots.

> 
> Chris
> 
>>
>> Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed).
>>
>> With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?
>>
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f6d78723f0fd..928d61e1d9d5 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  }
>>
>>  static inline bool swap_range_empty(char *swap_map, unsigned int start,
>> -                                   unsigned int nr_pages)
>> +                                   unsigned int nr_pages, bool *inc_cached)
>>  {
>>         unsigned int i;
>> -
>> -       for (i = 0; i < nr_pages; i++) {
>> -               if (swap_map[start + i])
>> -                       return false;
>> +       bool hit_cache = false;
>> +
>> +       if (*inc_cached) {
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       if (swap_map[start + i]) {
>> +                               if (swap_map[start + i] != SWAP_HAS_CACHE)
>> +                                       return false;
>> +                               hit_cache = true;
>> +                       }
>> +               }
>> +               *inc_cached = hit_cache;
>> +       } else {
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       if (swap_map[start + i])
>> +                               return false;
>> +               }
>>         }
>>
>>         return true;
>> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>         struct swap_cluster_info *ci;
>>         unsigned int tmp, max;
>>         unsigned int stop = SWAP_NEXT_INVALID;
>> +       bool evict = true;
>>
>>  new_cluster:
>>         cluster = this_cpu_ptr(si->percpu_cluster);
>> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>          * natural alignment.
>>          */
>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>> -       ci = lock_cluster(si, tmp);
>> -       while (tmp < max) {
>> -               if (swap_range_empty(si->swap_map, tmp, nr_pages))
>> -                       break;
>> -               tmp += nr_pages;
>> +scan_cluster:
>> +       if (tmp < max) {
>> +               ci = lock_cluster(si, tmp);
>> +               while (tmp < max) {
>> +                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
>> +                               break;
>> +                       tmp += nr_pages;
>> +               }
>> +               unlock_cluster(ci);
>>         }
>> -       unlock_cluster(ci);
>>         if (tmp >= max) {
>>                 cluster_dec_scanners(ci);
>>                 cluster->next[order] = SWAP_NEXT_INVALID;
>>                 goto new_cluster;
>>         }
>> +       if (evict) {
>> +               unsigned int off;
>> +               int nr;
>> +
>> +               spin_unlock(&si->lock);
>> +
>> +               for (off = tmp; off < tmp + nr_pages; off += nr) {
>> +                       nr = 1;
>> +                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
>> +                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
>> +                               if (nr < 0)
>> +                                       break;
>> +                               else if (nr == 0)
>> +                                       nr = 1;
>> +                               nr = ALIGN(off + 1, nr) - off;
>> +                       }
>> +               }
>> +
>> +               spin_lock(&si->lock);
>> +               *scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +               *offset = *scan_base;
>> +
>> +               if (nr < 0)
>> +                       tmp += nr_pages;
>> +
>> +               goto scan_cluster;
>> +       }
>>         *offset = tmp;
>>         *scan_base = tmp;
>>         tmp += nr_pages;
>>
>>

[RFC,v1,0/5] Alternative mTHP swap allocator improvements

Message

Comments