Message ID | 20240618232648.4090299-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | Alternative mTHP swap allocator improvements | expand |
Hi, Ryan, Ryan Roberts <ryan.roberts@arm.com> writes: > Hi All, > > Chris has been doing great work at [1] to clean up my mess in the mTHP swap > entry allocator. I don't think the original behavior is something like mess. It's just the first step in the correct direction. It's straightforward and obviously correctly. Then, we can optimize it step by step with data to justify the increased complexity. > But Barry posted a test program and results at [2] showing that > even with Chris's changes, there are still some fallbacks (around 5% - 25% in > some cases). I was interested in why that might be and ended up putting this PoC > patch set together to try to get a better understanding. This series ends up > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > much testing beyond that (yet) but thought it was worth posting on the strength > of that result alone. > > At a high level this works in a similar way to Chris's series; it marks a > cluster as being for a particular order and if a new cluster cannot be allocated > then it scans through the existing non-full clusters. But it does it by scanning > through the clusters rather than assembling them into a list. Cluster flags are > used to mark clusters that have been scanned and are known not to have enough > contiguous space, so the efficiency should be similar in practice. > > Because its not based around a linked list, there is less churn and I'm > wondering if this is perhaps easier to review and potentially even get into > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > for Chris's series? I know Chris has a larger roadmap of improvements, so at > best I see this as a tactical fix that will ultimately be superseeded by Chris's > work. I don't think we need any mTHP swap entry allocation optimization to go into v6.10-rcX. There's no functionality or performance regression. Per my understanding, we merge optimization when it's ready. Hi, Andrew, Please correct me if you don't agree. [snip] -- Best Regards, Huang, Ying
On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > Chris has been doing great work at [1] to clean up my mess in the mTHP swap > entry allocator. But Barry posted a test program and results at [2] showing that > even with Chris's changes, there are still some fallbacks (around 5% - 25% in > some cases). I was interested in why that might be and ended up putting this PoC > patch set together to try to get a better understanding. This series ends up > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > much testing beyond that (yet) but thought it was worth posting on the strength > of that result alone. > > At a high level this works in a similar way to Chris's series; it marks a > cluster as being for a particular order and if a new cluster cannot be allocated > then it scans through the existing non-full clusters. But it does it by scanning > through the clusters rather than assembling them into a list. Cluster flags are > used to mark clusters that have been scanned and are known not to have enough > contiguous space, so the efficiency should be similar in practice. > > Because its not based around a linked list, there is less churn and I'm > wondering if this is perhaps easier to review and potentially even get into > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > for Chris's series? I know Chris has a larger roadmap of improvements, so at > best I see this as a tactical fix that will ultimately be superseeded by Chris's > work. > > There are a few differences to note vs Chris's series: > > - order-0 fallback scanning is still allowed in any cluster; the argument in the > past was that swap should always use all the swap space, so I've left this > mechanism in. It is only a fallback though; first the the new per-order > scanner is invoked, even for order-0, so if there are free slots in clusters > already assigned for order-0, then the allocation will go there. > > - CPUs can steal slots from other CPU's current clusters; those clusters remain > scannable while they are current for a CPU and are only made unscannable when > no more CPUs are scanning that particular cluster. > > - I'm preferring to allocate a free cluster ahead of per-order scanning, since, > as I understand it, the original intent of a per-cpu current cluster was to > get pages for an application adjacent in the swap to speed up IO. > > I'd be keen to hear if you think we could get something like this into v6.10 to > fix the mess - I'm willing to work quickly to address comments and do more > testing. If not, then this is probably just a distraction and we should > concentrate on Chris's series. Ryan, thank you very much for accomplishing this. I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of add_to_swap() for both your approach and Chris's. I will update you with the results ASAP. I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has caused a couple of crashes. > > This applies on top of v6.10-rc4. > > [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/ > [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > > Thanks, > Ryan > > Ryan Roberts (5): > mm: swap: Simplify end-of-cluster calculation > mm: swap: Change SWAP_NEXT_INVALID to highest value > mm: swap: Track allocation order for clusters > mm: swap: Scan for free swap entries in allocated clusters > mm: swap: Optimize per-order cluster scanning > > include/linux/swap.h | 18 +++-- > mm/swapfile.c | 164 ++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 157 insertions(+), 25 deletions(-) > > -- > 2.43.0 > Thanks Barry
On 19/06/2024 08:19, Huang, Ying wrote: > Hi, Ryan, > > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Hi All, >> >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap >> entry allocator. > > I don't think the original behavior is something like mess. It's just > the first step in the correct direction. It's straightforward and > obviously correctly. Then, we can optimize it step by step with data to > justify the increased complexity. OK, perhaps I was over-egging it by calling it a "mess". What you're describing was my initial opinion too, but I saw Andrew complaining that we shouldn't be merging a feature if it doesn't work. This series fixes the problem in a minimal way - if you ignore the last patch, which is really is just a performance optimization and could be dropped. If we can ultimately get Chris's series to 0% fallback like this one, and everyone is happy with the current state for v6.10, then agreed - let's concentrate on Chris's series for v6.11. Thanks, Ryan > >> But Barry posted a test program and results at [2] showing that >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in >> some cases). I was interested in why that might be and ended up putting this PoC >> patch set together to try to get a better understanding. This series ends up >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done >> much testing beyond that (yet) but thought it was worth posting on the strength >> of that result alone. >> >> At a high level this works in a similar way to Chris's series; it marks a >> cluster as being for a particular order and if a new cluster cannot be allocated >> then it scans through the existing non-full clusters. But it does it by scanning >> through the clusters rather than assembling them into a list. Cluster flags are >> used to mark clusters that have been scanned and are known not to have enough >> contiguous space, so the efficiency should be similar in practice. >> >> Because its not based around a linked list, there is less churn and I'm >> wondering if this is perhaps easier to review and potentially even get into >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 >> for Chris's series? I know Chris has a larger roadmap of improvements, so at >> best I see this as a tactical fix that will ultimately be superseeded by Chris's >> work. > > I don't think we need any mTHP swap entry allocation optimization to go > into v6.10-rcX. There's no functionality or performance regression. > Per my understanding, we merge optimization when it's ready. > > Hi, Andrew, > > Please correct me if you don't agree. > > [snip] > > -- > Best Regards, > Huang, Ying
On 19/06/2024 10:11, Barry Song wrote: > On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap >> entry allocator. But Barry posted a test program and results at [2] showing that >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in >> some cases). I was interested in why that might be and ended up putting this PoC >> patch set together to try to get a better understanding. This series ends up >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done >> much testing beyond that (yet) but thought it was worth posting on the strength >> of that result alone. >> >> At a high level this works in a similar way to Chris's series; it marks a >> cluster as being for a particular order and if a new cluster cannot be allocated >> then it scans through the existing non-full clusters. But it does it by scanning >> through the clusters rather than assembling them into a list. Cluster flags are >> used to mark clusters that have been scanned and are known not to have enough >> contiguous space, so the efficiency should be similar in practice. >> >> Because its not based around a linked list, there is less churn and I'm >> wondering if this is perhaps easier to review and potentially even get into >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 >> for Chris's series? I know Chris has a larger roadmap of improvements, so at >> best I see this as a tactical fix that will ultimately be superseeded by Chris's >> work. >> >> There are a few differences to note vs Chris's series: >> >> - order-0 fallback scanning is still allowed in any cluster; the argument in the >> past was that swap should always use all the swap space, so I've left this >> mechanism in. It is only a fallback though; first the the new per-order >> scanner is invoked, even for order-0, so if there are free slots in clusters >> already assigned for order-0, then the allocation will go there. >> >> - CPUs can steal slots from other CPU's current clusters; those clusters remain >> scannable while they are current for a CPU and are only made unscannable when >> no more CPUs are scanning that particular cluster. >> >> - I'm preferring to allocate a free cluster ahead of per-order scanning, since, >> as I understand it, the original intent of a per-cpu current cluster was to >> get pages for an application adjacent in the swap to speed up IO. >> >> I'd be keen to hear if you think we could get something like this into v6.10 to >> fix the mess - I'm willing to work quickly to address comments and do more >> testing. If not, then this is probably just a distraction and we should >> concentrate on Chris's series. > > Ryan, thank you very much for accomplishing this. > > I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of > add_to_swap() for both your approach and Chris's. I will update you with > the results ASAP. Ahh great - look forward to the results! > > I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has > caused a couple of crashes. > >> >> This applies on top of v6.10-rc4. >> >> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/ >> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ >> >> Thanks, >> Ryan >> >> Ryan Roberts (5): >> mm: swap: Simplify end-of-cluster calculation >> mm: swap: Change SWAP_NEXT_INVALID to highest value >> mm: swap: Track allocation order for clusters >> mm: swap: Scan for free swap entries in allocated clusters >> mm: swap: Optimize per-order cluster scanning >> >> include/linux/swap.h | 18 +++-- >> mm/swapfile.c | 164 ++++++++++++++++++++++++++++++++++++++----- >> 2 files changed, 157 insertions(+), 25 deletions(-) >> >> -- >> 2.43.0 >> > > Thanks > Barry
Ryan Roberts <ryan.roberts@arm.com> writes: > On 19/06/2024 08:19, Huang, Ying wrote: >> Hi, Ryan, >> >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Hi All, >>> >>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap >>> entry allocator. >> >> I don't think the original behavior is something like mess. It's just >> the first step in the correct direction. It's straightforward and >> obviously correctly. Then, we can optimize it step by step with data to >> justify the increased complexity. > > OK, perhaps I was over-egging it by calling it a "mess". What you're describing > was my initial opinion too, but I saw Andrew complaining that we shouldn't be > merging a feature if it doesn't work. I don't think it doesn't work. It just works well for some workloads, but doesn't work well for some other workloads. We can always improve the implementation to make it works better for more workloads. From another point of view, even with your and Chris's improvement, mTHP swap entry allocation will fallback for quite some workloads. Use page allocator as an example. Even in current kernel, THP allocation may fallback for quite some workloads because of page fragmentation. But I don't think THP doesn't work now. > This series fixes the problem in a minimal > way - if you ignore the last patch, which is really is just a performance > optimization and could be dropped. > > If we can ultimately get Chris's series to 0% fallback like this one, and > everyone is happy with the current state for v6.10, then agreed - let's > concentrate on Chris's series for v6.11. [snip] -- Best Regards, Huang, Ying
On Wed, Jun 19, 2024 at 9:18 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 19/06/2024 10:11, Barry Song wrote: > > On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi All, > >> > >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap > >> entry allocator. But Barry posted a test program and results at [2] showing that > >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in > >> some cases). I was interested in why that might be and ended up putting this PoC > >> patch set together to try to get a better understanding. This series ends up > >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > >> much testing beyond that (yet) but thought it was worth posting on the strength > >> of that result alone. > >> > >> At a high level this works in a similar way to Chris's series; it marks a > >> cluster as being for a particular order and if a new cluster cannot be allocated > >> then it scans through the existing non-full clusters. But it does it by scanning > >> through the clusters rather than assembling them into a list. Cluster flags are > >> used to mark clusters that have been scanned and are known not to have enough > >> contiguous space, so the efficiency should be similar in practice. > >> > >> Because its not based around a linked list, there is less churn and I'm > >> wondering if this is perhaps easier to review and potentially even get into > >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > >> for Chris's series? I know Chris has a larger roadmap of improvements, so at > >> best I see this as a tactical fix that will ultimately be superseeded by Chris's > >> work. > >> > >> There are a few differences to note vs Chris's series: > >> > >> - order-0 fallback scanning is still allowed in any cluster; the argument in the > >> past was that swap should always use all the swap space, so I've left this > >> mechanism in. It is only a fallback though; first the the new per-order > >> scanner is invoked, even for order-0, so if there are free slots in clusters > >> already assigned for order-0, then the allocation will go there. > >> > >> - CPUs can steal slots from other CPU's current clusters; those clusters remain > >> scannable while they are current for a CPU and are only made unscannable when > >> no more CPUs are scanning that particular cluster. > >> > >> - I'm preferring to allocate a free cluster ahead of per-order scanning, since, > >> as I understand it, the original intent of a per-cpu current cluster was to > >> get pages for an application adjacent in the swap to speed up IO. > >> > >> I'd be keen to hear if you think we could get something like this into v6.10 to > >> fix the mess - I'm willing to work quickly to address comments and do more > >> testing. If not, then this is probably just a distraction and we should > >> concentrate on Chris's series. > > > > Ryan, thank you very much for accomplishing this. > > > > I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of > > add_to_swap() for both your approach and Chris's. I will update you with > > the results ASAP. > > Ahh great - look forward to the results! Essentially, we are measuring two types of latency: * Small folio swap allocation * Large folio swap allocation The concept code is like diff --git a/mm/swap_state.c b/mm/swap_state.c index 994723cef821..a608b916ed2f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -185,10 +185,18 @@ bool add_to_swap(struct folio *folio) VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); + start_time = ktime_get(); + entry = folio_alloc_swap(folio); if (!entry.val) return false; + end_time = ktime_get(); + if (folio_test_large(folio)) + trace_large_swap_allocation_latency(ktime_sub(end_time - start_time)); + else + trace_small_swap_allocation_latency(ktime_sub(end_time - start_time)); + /* * XArray node allocations from PF_MEMALLOC contexts could * completely exhaust the page allocator. __GFP_NOMEMALLOC Then, we'll generate histograms for both large and small allocation latency. We're currently encountering some setup issues. Once we have the data, I'll provide updates to you and Chris. Additionally, I noticed some comments suggesting that Chris's patch might negatively impact the swap allocation latency of small folios. Perhaps the data can help clarify this. > > > > > I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has > > caused a couple of crashes. > > > >> > >> This applies on top of v6.10-rc4. > >> > >> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/ > >> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > >> > >> Thanks, > >> Ryan > >> > >> Ryan Roberts (5): > >> mm: swap: Simplify end-of-cluster calculation > >> mm: swap: Change SWAP_NEXT_INVALID to highest value > >> mm: swap: Track allocation order for clusters > >> mm: swap: Scan for free swap entries in allocated clusters > >> mm: swap: Optimize per-order cluster scanning > >> > >> include/linux/swap.h | 18 +++-- > >> mm/swapfile.c | 164 ++++++++++++++++++++++++++++++++++++++----- > >> 2 files changed, 157 insertions(+), 25 deletions(-) > >> > >> -- > >> 2.43.0 > >> > Thanks Barry
On 19/06/2024 00:26, Ryan Roberts wrote: > Hi All, > > Chris has been doing great work at [1] to clean up my mess in the mTHP swap > entry allocator. But Barry posted a test program and results at [2] showing that > even with Chris's changes, there are still some fallbacks (around 5% - 25% in > some cases). I was interested in why that might be and ended up putting this PoC > patch set together to try to get a better understanding. This series ends up > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > much testing beyond that (yet) but thought it was worth posting on the strength > of that result alone. > > At a high level this works in a similar way to Chris's series; it marks a > cluster as being for a particular order and if a new cluster cannot be allocated > then it scans through the existing non-full clusters. But it does it by scanning > through the clusters rather than assembling them into a list. Cluster flags are > used to mark clusters that have been scanned and are known not to have enough > contiguous space, so the efficiency should be similar in practice. > > Because its not based around a linked list, there is less churn and I'm > wondering if this is perhaps easier to review and potentially even get into > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > for Chris's series? I know Chris has a larger roadmap of improvements, so at > best I see this as a tactical fix that will ultimately be superseeded by Chris's > work. > > There are a few differences to note vs Chris's series: > > - order-0 fallback scanning is still allowed in any cluster; the argument in the > past was that swap should always use all the swap space, so I've left this > mechanism in. It is only a fallback though; first the the new per-order > scanner is invoked, even for order-0, so if there are free slots in clusters > already assigned for order-0, then the allocation will go there. > > - CPUs can steal slots from other CPU's current clusters; those clusters remain > scannable while they are current for a CPU and are only made unscannable when > no more CPUs are scanning that particular cluster. > > - I'm preferring to allocate a free cluster ahead of per-order scanning, since, > as I understand it, the original intent of a per-cpu current cluster was to > get pages for an application adjacent in the swap to speed up IO. [...] I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense). However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses. But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ? Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper? diff --git a/mm/swapfile.c b/mm/swapfile.c index f6d78723f0fd..928d61e1d9d5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, } static inline bool swap_range_empty(char *swap_map, unsigned int start, - unsigned int nr_pages) + unsigned int nr_pages, bool *inc_cached) { unsigned int i; - - for (i = 0; i < nr_pages; i++) { - if (swap_map[start + i]) - return false; + bool hit_cache = false; + + if (*inc_cached) { + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) { + if (swap_map[start + i] != SWAP_HAS_CACHE) + return false; + hit_cache = true; + } + } + *inc_cached = hit_cache; + } else { + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) + return false; + } } return true; @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci; unsigned int tmp, max; unsigned int stop = SWAP_NEXT_INVALID; + bool evict = true; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, * natural alignment. */ max = ALIGN(tmp + 1, SWAPFILE_CLUSTER); - ci = lock_cluster(si, tmp); - while (tmp < max) { - if (swap_range_empty(si->swap_map, tmp, nr_pages)) - break; - tmp += nr_pages; +scan_cluster: + if (tmp < max) { + ci = lock_cluster(si, tmp); + while (tmp < max) { + if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict)) + break; + tmp += nr_pages; + } + unlock_cluster(ci); } - unlock_cluster(ci); if (tmp >= max) { cluster_dec_scanners(ci); cluster->next[order] = SWAP_NEXT_INVALID; goto new_cluster; } + if (evict) { + unsigned int off; + int nr; + + spin_unlock(&si->lock); + + for (off = tmp; off < tmp + nr_pages; off += nr) { + nr = 1; + if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) { + nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY); + if (nr < 0) + break; + else if (nr == 0) + nr = 1; + nr = ALIGN(off + 1, nr) - off; + } + } + + spin_lock(&si->lock); + *scan_base = this_cpu_read(*si->cluster_next_cpu); + *offset = *scan_base; + + if (nr < 0) + tmp += nr_pages; + + goto scan_cluster; + } *offset = tmp; *scan_base = tmp; tmp += nr_pages;
On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 19/06/2024 00:26, Ryan Roberts wrote: > > Hi All, > > > > Chris has been doing great work at [1] to clean up my mess in the mTHP swap > > entry allocator. But Barry posted a test program and results at [2] showing that > > even with Chris's changes, there are still some fallbacks (around 5% - 25% in > > some cases). I was interested in why that might be and ended up putting this PoC > > patch set together to try to get a better understanding. This series ends up > > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > > much testing beyond that (yet) but thought it was worth posting on the strength > > of that result alone. > > > > At a high level this works in a similar way to Chris's series; it marks a > > cluster as being for a particular order and if a new cluster cannot be allocated > > then it scans through the existing non-full clusters. But it does it by scanning > > through the clusters rather than assembling them into a list. Cluster flags are > > used to mark clusters that have been scanned and are known not to have enough > > contiguous space, so the efficiency should be similar in practice. > > > > Because its not based around a linked list, there is less churn and I'm > > wondering if this is perhaps easier to review and potentially even get into > > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > > for Chris's series? I know Chris has a larger roadmap of improvements, so at > > best I see this as a tactical fix that will ultimately be superseeded by Chris's > > work. > > > > There are a few differences to note vs Chris's series: > > > > - order-0 fallback scanning is still allowed in any cluster; the argument in the > > past was that swap should always use all the swap space, so I've left this > > mechanism in. It is only a fallback though; first the the new per-order > > scanner is invoked, even for order-0, so if there are free slots in clusters > > already assigned for order-0, then the allocation will go there. > > > > - CPUs can steal slots from other CPU's current clusters; those clusters remain > > scannable while they are current for a CPU and are only made unscannable when > > no more CPUs are scanning that particular cluster. > > > > - I'm preferring to allocate a free cluster ahead of per-order scanning, since, > > as I understand it, the original intent of a per-cpu current cluster was to > > get pages for an application adjacent in the swap to speed up IO. > > [...] > > I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense). > > However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses. > > But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ? I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE which contribute to the swap allocation failure rate. One possibility is that the zram is hitting the synchronous IO case for swap in, it does not put the folio in swap cache when serving the swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order to address a race in the swap in. But the SWAP_HAS_CACHE should be cleaned up after swap in fault though. I did not have a chance to dig deeper into the root cause of those SWAP_HAS_CACHE entries yet. Chris > > Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). > > With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper? > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f6d78723f0fd..928d61e1d9d5 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > } > > static inline bool swap_range_empty(char *swap_map, unsigned int start, > - unsigned int nr_pages) > + unsigned int nr_pages, bool *inc_cached) > { > unsigned int i; > - > - for (i = 0; i < nr_pages; i++) { > - if (swap_map[start + i]) > - return false; > + bool hit_cache = false; > + > + if (*inc_cached) { > + for (i = 0; i < nr_pages; i++) { > + if (swap_map[start + i]) { > + if (swap_map[start + i] != SWAP_HAS_CACHE) > + return false; > + hit_cache = true; > + } > + } > + *inc_cached = hit_cache; > + } else { > + for (i = 0; i < nr_pages; i++) { > + if (swap_map[start + i]) > + return false; > + } > } > > return true; > @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > struct swap_cluster_info *ci; > unsigned int tmp, max; > unsigned int stop = SWAP_NEXT_INVALID; > + bool evict = true; > > new_cluster: > cluster = this_cpu_ptr(si->percpu_cluster); > @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > * natural alignment. > */ > max = ALIGN(tmp + 1, SWAPFILE_CLUSTER); > - ci = lock_cluster(si, tmp); > - while (tmp < max) { > - if (swap_range_empty(si->swap_map, tmp, nr_pages)) > - break; > - tmp += nr_pages; > +scan_cluster: > + if (tmp < max) { > + ci = lock_cluster(si, tmp); > + while (tmp < max) { > + if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict)) > + break; > + tmp += nr_pages; > + } > + unlock_cluster(ci); > } > - unlock_cluster(ci); > if (tmp >= max) { > cluster_dec_scanners(ci); > cluster->next[order] = SWAP_NEXT_INVALID; > goto new_cluster; > } > + if (evict) { > + unsigned int off; > + int nr; > + > + spin_unlock(&si->lock); > + > + for (off = tmp; off < tmp + nr_pages; off += nr) { > + nr = 1; > + if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) { > + nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY); > + if (nr < 0) > + break; > + else if (nr == 0) > + nr = 1; > + nr = ALIGN(off + 1, nr) - off; > + } > + } > + > + spin_lock(&si->lock); > + *scan_base = this_cpu_read(*si->cluster_next_cpu); > + *offset = *scan_base; > + > + if (nr < 0) > + tmp += nr_pages; > + > + goto scan_cluster; > + } > *offset = tmp; > *scan_base = tmp; > tmp += nr_pages; > >
On 25/06/2024 17:06, Chris Li wrote: > On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 19/06/2024 00:26, Ryan Roberts wrote: >>> Hi All, >>> >>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap >>> entry allocator. But Barry posted a test program and results at [2] showing that >>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in >>> some cases). I was interested in why that might be and ended up putting this PoC >>> patch set together to try to get a better understanding. This series ends up >>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done >>> much testing beyond that (yet) but thought it was worth posting on the strength >>> of that result alone. >>> >>> At a high level this works in a similar way to Chris's series; it marks a >>> cluster as being for a particular order and if a new cluster cannot be allocated >>> then it scans through the existing non-full clusters. But it does it by scanning >>> through the clusters rather than assembling them into a list. Cluster flags are >>> used to mark clusters that have been scanned and are known not to have enough >>> contiguous space, so the efficiency should be similar in practice. >>> >>> Because its not based around a linked list, there is less churn and I'm >>> wondering if this is perhaps easier to review and potentially even get into >>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 >>> for Chris's series? I know Chris has a larger roadmap of improvements, so at >>> best I see this as a tactical fix that will ultimately be superseeded by Chris's >>> work. >>> >>> There are a few differences to note vs Chris's series: >>> >>> - order-0 fallback scanning is still allowed in any cluster; the argument in the >>> past was that swap should always use all the swap space, so I've left this >>> mechanism in. It is only a fallback though; first the the new per-order >>> scanner is invoked, even for order-0, so if there are free slots in clusters >>> already assigned for order-0, then the allocation will go there. >>> >>> - CPUs can steal slots from other CPU's current clusters; those clusters remain >>> scannable while they are current for a CPU and are only made unscannable when >>> no more CPUs are scanning that particular cluster. >>> >>> - I'm preferring to allocate a free cluster ahead of per-order scanning, since, >>> as I understand it, the original intent of a per-cpu current cluster was to >>> get pages for an application adjacent in the swap to speed up IO. >> >> [...] >> >> I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense). >> >> However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses. >> >> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ? > > I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE > which contribute to the swap allocation failure rate. > > One possibility is that the zram is hitting the synchronous IO case > for swap in, it does not put the folio in swap cache when serving the > swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from > Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order > to address a race in the swap in. But the SWAP_HAS_CACHE should be > cleaned up after swap in fault though. I did not have a chance to dig > deeper into the root cause of those SWAP_HAS_CACHE entries yet. Oh wait, I think they are probably in the swap slot cache waiting to be freed. Once the count goes to 0 (inc SWAP_HAS_CACHE), __swap_entry_free_locked temporarily adds SWAP_HAS_CACHE back in, and free_swap_slot() is called. The entry is buffered in that layer until the buffer is full and they are all flushed together. So should be able to trigger that flush to free some slots. > > Chris > >> >> Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). >> >> With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper? >> >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f6d78723f0fd..928d61e1d9d5 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, >> } >> >> static inline bool swap_range_empty(char *swap_map, unsigned int start, >> - unsigned int nr_pages) >> + unsigned int nr_pages, bool *inc_cached) >> { >> unsigned int i; >> - >> - for (i = 0; i < nr_pages; i++) { >> - if (swap_map[start + i]) >> - return false; >> + bool hit_cache = false; >> + >> + if (*inc_cached) { >> + for (i = 0; i < nr_pages; i++) { >> + if (swap_map[start + i]) { >> + if (swap_map[start + i] != SWAP_HAS_CACHE) >> + return false; >> + hit_cache = true; >> + } >> + } >> + *inc_cached = hit_cache; >> + } else { >> + for (i = 0; i < nr_pages; i++) { >> + if (swap_map[start + i]) >> + return false; >> + } >> } >> >> return true; >> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> struct swap_cluster_info *ci; >> unsigned int tmp, max; >> unsigned int stop = SWAP_NEXT_INVALID; >> + bool evict = true; >> >> new_cluster: >> cluster = this_cpu_ptr(si->percpu_cluster); >> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> * natural alignment. >> */ >> max = ALIGN(tmp + 1, SWAPFILE_CLUSTER); >> - ci = lock_cluster(si, tmp); >> - while (tmp < max) { >> - if (swap_range_empty(si->swap_map, tmp, nr_pages)) >> - break; >> - tmp += nr_pages; >> +scan_cluster: >> + if (tmp < max) { >> + ci = lock_cluster(si, tmp); >> + while (tmp < max) { >> + if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict)) >> + break; >> + tmp += nr_pages; >> + } >> + unlock_cluster(ci); >> } >> - unlock_cluster(ci); >> if (tmp >= max) { >> cluster_dec_scanners(ci); >> cluster->next[order] = SWAP_NEXT_INVALID; >> goto new_cluster; >> } >> + if (evict) { >> + unsigned int off; >> + int nr; >> + >> + spin_unlock(&si->lock); >> + >> + for (off = tmp; off < tmp + nr_pages; off += nr) { >> + nr = 1; >> + if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) { >> + nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY); >> + if (nr < 0) >> + break; >> + else if (nr == 0) >> + nr = 1; >> + nr = ALIGN(off + 1, nr) - off; >> + } >> + } >> + >> + spin_lock(&si->lock); >> + *scan_base = this_cpu_read(*si->cluster_next_cpu); >> + *offset = *scan_base; >> + >> + if (nr < 0) >> + tmp += nr_pages; >> + >> + goto scan_cluster; >> + } >> *offset = tmp; >> *scan_base = tmp; >> tmp += nr_pages; >> >>