Message ID | 20240619-swap-allocator-v3-0-e973a3102444@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | mm: swap: mTHP swap allocator base on swap cluster order | expand |
On 19/06/2024 10:20, Chris Li wrote: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. It has a fragmentation issue > reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > User impact: For users that allocate and free mix order mTHP swapping, > It greatly improves the success rate of the mTHP swap allocation after the > initial phase. > > Barry provides a test program to show the effect: > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > > Without: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > > With: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% Excellent! > > Reported-by: Barry Song <21cnbao@gmail.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Changes in v3: > - Using V1 as base. > - Rename "next" to "list" for the list field, suggested by Ying. > - Update comment for the locking rules for cluster fields and list, > suggested by Ying. > - Allocate from the nonfull list before attempting free list, suggested > by Kairui. Sorry I didn't follow this original conversation. But the original intent of having a per-cpu current cluster was to prevent interleving pages from multiple processes and therefore optimize IO. See commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"). I wonder if this change could lead to a swap performance regression in the common order-0 case? Thanks, Ryan > - Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org > > Changes in v2: > - Abandoned. > - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org > > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 30 +++---- > mm/swapfile.c | 248 +++++++++++++++++---------------------------------- > 2 files changed, 95 insertions(+), 183 deletions(-) > --- > base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards,
Chris Li <chrisl@kernel.org> writes: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. It has a fragmentation issue > reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > User impact: For users that allocate and free mix order mTHP swapping, > It greatly improves the success rate of the mTHP swap allocation after the > initial phase. > > Barry provides a test program to show the effect: > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > > Without: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > > With: > $ mthp-swapout > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > $ mthp-swapout -s > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% Unfortunately, the data is gotten using a special designed test program which always swap-in pages with swapped-out size. I don't know whether such workloads exist in reality. Otherwise, you need to wait for mTHP swap-in to be merged firstly, and people reach consensus that we should always swap-in pages with swapped-out size. Alternately, we can make some design adjustment to make the patchset work in current situation (mTHP swap-out, normal page swap-in). - One non-full cluster list for each order (same as current design) - When one swap entry is freed, check whether one "order+1" swap entry becomes free, if so, move the cluster to "order+1" non-full cluster list. - When allocate swap entry with "order", get cluster from free, "order", "order+1", ... non-full cluster list. If all are empty, fallback to order 0. Do you think that this works? > Reported-by: Barry Song <21cnbao@gmail.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Changes in v3: > - Using V1 as base. > - Rename "next" to "list" for the list field, suggested by Ying. > - Update comment for the locking rules for cluster fields and list, > suggested by Ying. > - Allocate from the nonfull list before attempting free list, suggested > by Kairui. Haven't looked into this. It appears that this breaks the original discard behavior which helps performance of some SSD, please refer to commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). And as pointed out by Ryan, this may reduce the opportunity of the sequential block device writing during swap-out, which may hurt performance of SSD too. [snip] -- Best Regards, Huang, Ying
On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. It has a fragmentation issue > > reported by Barry. > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > > > User impact: For users that allocate and free mix order mTHP swapping, > > It greatly improves the success rate of the mTHP swap allocation after the > > initial phase. > > > > Barry provides a test program to show the effect: > > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > > > > Without: > > $ mthp-swapout > > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > > > > $ mthp-swapout -s > > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > > > > With: > > $ mthp-swapout > > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > > ... > > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > > > $ mthp-swapout -s > > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > > ... > > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% > > Unfortunately, the data is gotten using a special designed test program > which always swap-in pages with swapped-out size. I don't know whether > such workloads exist in reality. Otherwise, you need to wait for mTHP The test program is designed to simulate mTHP swap behavior using zsmalloc and 64KB buffer. If we insist on only designing for existing workloads, then zsmalloc using 64KB buffer usage will never be able to run, exactly due the kernel has high failure rate allocating swap entries for 64KB. There is a bit of a chick and egg problem there, such a usage can not exist because the kernel can't support it yet. Kernel can't add patches to support it because such simulation tests are not "real". We need to break this cycle to support something new. > swap-in to be merged firstly, and people reach consensus that we should > always swap-in pages with swapped-out size. We don't have to be always. We can identify the situation that makes sense. For the zram/zsmalloc 64K buffer usage case, swap out as the same swap in size makes sense. I think we have agreement on such zsmalloc 64K usage cases we do want to support. > > Alternately, we can make some design adjustment to make the patchset > work in current situation (mTHP swap-out, normal page swap-in). > > - One non-full cluster list for each order (same as current design) > > - When one swap entry is freed, check whether one "order+1" swap entry > becomes free, if so, move the cluster to "order+1" non-full cluster > list. In the intended zsmalloc usage case, there is no order+1 swap entry request. Moving the cluster to "order+1" will make less cluster available for "order". For that usage case it is negative gain. > - When allocate swap entry with "order", get cluster from free, "order", > "order+1", ... non-full cluster list. If all are empty, fallback to I don't see that it is useful for the zsmalloc 64K buffer usage case. There will be order 0 and order 4 and nothing else. How about let's keep it simple for now. If we identify some workload this algorithm can help. We can do that as a follow up step. > order 0. > > Do you think that this works? > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Changes in v3: > > - Using V1 as base. > > - Rename "next" to "list" for the list field, suggested by Ying. > > - Update comment for the locking rules for cluster fields and list, > > suggested by Ying. > > - Allocate from the nonfull list before attempting free list, suggested > > by Kairui. > > Haven't looked into this. It appears that this breaks the original > discard behavior which helps performance of some SSD, please refer to Can you clarify by "discard" you mean SSD discard command or just the way swap allocator recycles free clusters? > commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). I did read that change log. Help me understand in more detail which discard behavior you have in mind. A lot of low end micro SD cards have proper FTL wear leveling now, ssd even better on that. > And as pointed out by Ryan, this may reduce the opportunity of the > sequential block device writing during swap-out, which may hurt > performance of SSD too. Only at the initial phase. If the swap IO continues, after the first pass fills up the swap file, the write will be random on the swapfile anyway. Because the swapfile only issues 2M discards commands when all 512 4K pages are free. The discarded area will be much smaller than the free area on swapfile. That combined with the random write page on the whole swap file. It might produce a worse internal write amplification for SSD, compared to only writing a subset of the swapfile area. I would love to hear from someone who understands SSD internals to confirm or deny my theory. Even let's assume the SSD wants a free block over a nonfull cluster first. Zswap and zram swap are not subject to SSD property. We might want to have a kernel option to select using nonfree clusters over the free one for zram and zswap (ghost swapfile). That will help contain the fragmented swap area. Chris
Chris Li <chrisl@kernel.org> writes: > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > This is the short term solutiolns "swap cluster order" listed >> > in my "Swap Abstraction" discussion slice 8 in the recent >> > LSF/MM conference. >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> > orders" is introduced, it only allocates the mTHP swap entries >> > from new empty cluster list. It has a fragmentation issue >> > reported by Barry. >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >> > >> > The reason is that all the empty cluster has been exhausted while >> > there are planty of free swap entries to in the cluster that is >> > not 100% free. >> > >> > Remember the swap allocation order in the cluster. >> > Keep track of the per order non full cluster list for later allocation. >> > >> > User impact: For users that allocate and free mix order mTHP swapping, >> > It greatly improves the success rate of the mTHP swap allocation after the >> > initial phase. >> > >> > Barry provides a test program to show the effect: >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ >> > >> > Without: >> > $ mthp-swapout >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% >> > >> > $ mthp-swapout -s >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> > >> > With: >> > $ mthp-swapout >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> > ... >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> > >> > $ mthp-swapout -s >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> > ... >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> Unfortunately, the data is gotten using a special designed test program >> which always swap-in pages with swapped-out size. I don't know whether >> such workloads exist in reality. Otherwise, you need to wait for mTHP > > The test program is designed to simulate mTHP swap behavior using > zsmalloc and 64KB buffer. > If we insist on only designing for existing workloads, then zsmalloc > using 64KB buffer usage will never be able to run, exactly due the > kernel has high failure rate allocating swap entries for 64KB. There > is a bit of a chick and egg problem there, such a usage can not exist > because the kernel can't support it yet. Kernel can't add patches to > support it because such simulation tests are not "real". > > We need to break this cycle to support something new. > >> swap-in to be merged firstly, and people reach consensus that we should >> always swap-in pages with swapped-out size. > > We don't have to be always. We can identify the situation that makes > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the > same swap in size makes sense. > I think we have agreement on such zsmalloc 64K usage cases we do want > to support. > >> >> Alternately, we can make some design adjustment to make the patchset >> work in current situation (mTHP swap-out, normal page swap-in). >> >> - One non-full cluster list for each order (same as current design) >> >> - When one swap entry is freed, check whether one "order+1" swap entry >> becomes free, if so, move the cluster to "order+1" non-full cluster >> list. > > In the intended zsmalloc usage case, there is no order+1 swap entry > request. This my main concern about this series. Only the Android use cases are considered. The general use cases are just ignored. Is it hard to consider or test a normal swap partition on your development machine? > Moving the cluster to "order+1" will make less cluster available for "order". > For that usage case it is negative gain. The "order+1" cluster can be used to allocate "order" cluster when existing "order" cluster is used up. And in this way, we can protect clusters with more free spaces so that they may become free. >> - When allocate swap entry with "order", get cluster from free, "order", >> "order+1", ... non-full cluster list. If all are empty, fallback to > > I don't see that it is useful for the zsmalloc 64K buffer usage case. > There will be order 0 and order 4 and nothing else. > > How about let's keep it simple for now. If we identify some workload > this algorithm can help. We can do that as a follow up step. The simple design isn't flexible enough for your workloads too. For example, - Initially, almost only order-0 pages are swapped out, most non-full clusters are order-0. - Later, quite some order-0 swap entries are freed so that there are quite some order-4 swap entries available. - Order-4 pages need to be swapped out, but no enough order-4 non-full clusters available. So, we need a way to migrate non-full clusters among orders to adjust to the situations automatically. >> order 0. >> >> Do you think that this works? >> >> > Reported-by: Barry Song <21cnbao@gmail.com> >> > Signed-off-by: Chris Li <chrisl@kernel.org> >> > --- >> > Changes in v3: >> > - Using V1 as base. >> > - Rename "next" to "list" for the list field, suggested by Ying. >> > - Update comment for the locking rules for cluster fields and list, >> > suggested by Ying. >> > - Allocate from the nonfull list before attempting free list, suggested >> > by Kairui. >> >> Haven't looked into this. It appears that this breaks the original >> discard behavior which helps performance of some SSD, please refer to > > Can you clarify by "discard" you mean SSD discard command or just the > way swap allocator recycles free clusters? The SSD discard command, like in the following URL, https://en.wikipedia.org/wiki/Trim_(computing) >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). > > I did read that change log. Help me understand in more detail which > discard behavior you have in mind. A lot of low end micro SD cards > have proper FTL wear leveling now, ssd even better on that. It's not FTL, it's discard/trim for SSD as above. >> And as pointed out by Ryan, this may reduce the opportunity of the >> sequential block device writing during swap-out, which may hurt >> performance of SSD too. > > Only at the initial phase. If the swap IO continues, after the first > pass fills up the swap file, the write will be random on the swapfile > anyway. Because the swapfile only issues 2M discards commands when all > 512 4K pages are free. The discarded area will be much smaller than > the free area on swapfile. That combined with the random write page on > the whole swap file. It might produce a worse internal write > amplification for SSD, compared to only writing a subset of the > swapfile area. I would love to hear from someone who understands SSD > internals to confirm or deny my theory. It depends on workloads. Some workloads will have more severe fragmentation than others. For example, on quite some machines, the swap devices will be far from being full to avoid possible OOM. > Even let's assume the SSD wants a free block over a nonfull cluster > first. Zswap and zram swap are not subject to SSD property. We might > want to have a kernel option to select using nonfree clusters over > the free one for zram and zswap (ghost swapfile). That will help > contain the fragmented swap area. I suspect that it will help fragmentation avoidance much. Please prove its effectiveness with data firstly. It can be a further optimization patch in the series. Even if we really need it, we can try to do it without a kernel option. For example, detect whether we are using zram and enable it for zram automatically (through a general flag). -- Best Regards, Huang, Ying
Chris Li <chrisl@kernel.org> writes: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. It has a fragmentation issue > reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. The "non full" is a kind of negative naming, can we use "partial" as that used in "slub"? [snip] -- Best Regards, Huang, Ying
On Mon, Jun 24, 2024 at 7:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > This is the short term solutiolns "swap cluster order" listed > >> > in my "Swap Abstraction" discussion slice 8 in the recent > >> > LSF/MM conference. > >> > > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> > orders" is introduced, it only allocates the mTHP swap entries > >> > from new empty cluster list. It has a fragmentation issue > >> > reported by Barry. > >> > > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > >> > > >> > The reason is that all the empty cluster has been exhausted while > >> > there are planty of free swap entries to in the cluster that is > >> > not 100% free. > >> > > >> > Remember the swap allocation order in the cluster. > >> > Keep track of the per order non full cluster list for later allocation. > >> > > >> > User impact: For users that allocate and free mix order mTHP swapping, > >> > It greatly improves the success rate of the mTHP swap allocation after the > >> > initial phase. > >> > > >> > Barry provides a test program to show the effect: > >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > >> > > >> > Without: > >> > $ mthp-swapout > >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > >> > > >> > $ mthp-swapout -s > >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > >> > > >> > With: > >> > $ mthp-swapout > >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > ... > >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > > >> > $ mthp-swapout -s > >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > ... > >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% > >> > >> Unfortunately, the data is gotten using a special designed test program > >> which always swap-in pages with swapped-out size. I don't know whether > >> such workloads exist in reality. Otherwise, you need to wait for mTHP > > > > The test program is designed to simulate mTHP swap behavior using > > zsmalloc and 64KB buffer. > > If we insist on only designing for existing workloads, then zsmalloc > > using 64KB buffer usage will never be able to run, exactly due the > > kernel has high failure rate allocating swap entries for 64KB. There > > is a bit of a chick and egg problem there, such a usage can not exist > > because the kernel can't support it yet. Kernel can't add patches to > > support it because such simulation tests are not "real". > > > > We need to break this cycle to support something new. > > > >> swap-in to be merged firstly, and people reach consensus that we should > >> always swap-in pages with swapped-out size. > > > > We don't have to be always. We can identify the situation that makes > > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the > > same swap in size makes sense. > > I think we have agreement on such zsmalloc 64K usage cases we do want > > to support. > > > >> > >> Alternately, we can make some design adjustment to make the patchset > >> work in current situation (mTHP swap-out, normal page swap-in). > >> > >> - One non-full cluster list for each order (same as current design) > >> > >> - When one swap entry is freed, check whether one "order+1" swap entry > >> becomes free, if so, move the cluster to "order+1" non-full cluster > >> list. > > > > In the intended zsmalloc usage case, there is no order+1 swap entry > > request. > > This my main concern about this series. Only the Android use cases are > considered. The general use cases are just ignored. Is it hard to > consider or test a normal swap partition on your development machine? Please see the V4 cover letter. The V4 already has the SSD, zram and HDD stress testing. Of course I want to make sure the allocator works well with Barry's mthp test case as well. > > Moving the cluster to "order+1" will make less cluster available for "order". > > For that usage case it is negative gain. > > The "order+1" cluster can be used to allocate "order" cluster when > existing "order" cluster is used up. > > And in this way, we can protect clusters with more free spaces so that > they may become free. > > >> - When allocate swap entry with "order", get cluster from free, "order", > >> "order+1", ... non-full cluster list. If all are empty, fallback to > > > > I don't see that it is useful for the zsmalloc 64K buffer usage case. > > There will be order 0 and order 4 and nothing else. > > > > How about let's keep it simple for now. If we identify some workload > > this algorithm can help. We can do that as a follow up step. > > The simple design isn't flexible enough for your workloads too. For > example, > > - Initially, almost only order-0 pages are swapped out, most non-full > clusters are order-0. > > - Later, quite some order-0 swap entries are freed so that there are > quite some order-4 swap entries available. > > - Order-4 pages need to be swapped out, but no enough order-4 non-full > clusters available. > > So, we need a way to migrate non-full clusters among orders to adjust to > the situations automatically. Depends on how lucky it is to form the order-4 cluster naturally. The odds of forming the order-4 cluster naturally in random swap allocation/ free case is very low. I have the number in my other email thread. Anyway, if we convince this payout for the complexity it introduces, we can do that as follow up steps. Try to keep things simple at first for the review benefit. > > >> order 0. > >> > >> Do you think that this works? > >> > >> > Reported-by: Barry Song <21cnbao@gmail.com> > >> > Signed-off-by: Chris Li <chrisl@kernel.org> > >> > --- > >> > Changes in v3: > >> > - Using V1 as base. > >> > - Rename "next" to "list" for the list field, suggested by Ying. > >> > - Update comment for the locking rules for cluster fields and list, > >> > suggested by Ying. > >> > - Allocate from the nonfull list before attempting free list, suggested > >> > by Kairui. > >> > >> Haven't looked into this. It appears that this breaks the original > >> discard behavior which helps performance of some SSD, please refer to > > > > Can you clarify by "discard" you mean SSD discard command or just the > > way swap allocator recycles free clusters? > > The SSD discard command, like in the following URL, > > https://en.wikipedia.org/wiki/Trim_(computing) Thanks. I know what an SSD discard command is. Want to understand why that behavior is preferred. So the reasoning to prefer a new free block rather than a recent particle free cluster is to let the previous written cluster have a higher chance to issue the discard command? This preferred new block behavior is actually not friendly to SSD from a wearing point of view. Take this example: Let say the data need to allocate and free from swap. At any given time the swap usage is 1G. The swap SSD drive is 16G. Let say the allocation and free are at random 4K page locations. There is totally 64G swap data needed to write to swap, but at any given time there is only 1G data occupite on swapfile. a) If you always prefer new free blocks. Then the swap data will eventually write at all 16G drives then random write to full 16G. Chance of forming a free cluster so a discard command can be issued is very low. (15/16)**512 = 4.4E-15. From SSD point of view, it does not know most of the data written to 16G drive is not used. When a page is free on a swapfile, SSD drive doesn't know about it. It sees 4K random writes to all 16G of the drive, total 64G data written. b) If you always prefer a non full cluster first over a new cluster. The 64G data will concentrate random writing to the first 1G of drive location. Total 64G data written. I consider b) are more friendly to SSD than a). Because concentrate the write into the first 1G location. The SSD can know the data overwritten in those 1G has internally obsolete, so it can internally GC the those overwritten data without a discard command. Where a) random 4K writes to the whole drive without much discard at all. Full SSD doing random writes is a bad combination from a wearing point of view. Just my 2 cents. Anyway I revert the V4 to use free cluster before nonfull cluster just to behave the same as previously. > >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). > > > > I did read that change log. Help me understand in more detail which > > discard behavior you have in mind. A lot of low end micro SD cards > > have proper FTL wear leveling now, ssd even better on that. > > It's not FTL, it's discard/trim for SSD as above. Thanks for the clarification. > > >> And as pointed out by Ryan, this may reduce the opportunity of the > >> sequential block device writing during swap-out, which may hurt > >> performance of SSD too. > > > > Only at the initial phase. If the swap IO continues, after the first > > pass fills up the swap file, the write will be random on the swapfile > > anyway. Because the swapfile only issues 2M discards commands when all > > 512 4K pages are free. The discarded area will be much smaller than > > the free area on swapfile. That combined with the random write page on > > the whole swap file. It might produce a worse internal write > > amplification for SSD, compared to only writing a subset of the > > swapfile area. I would love to hear from someone who understands SSD > > internals to confirm or deny my theory. > > It depends on workloads. Some workloads will have more severe > fragmentation than others. For example, on quite some machines, the > swap devices will be far from being full to avoid possible OOM. I suspect most of the SSD swap on client devices nowadays are only as backup just in case it needs to be swapped. There is not much SSD swap IO during normal use. The zram and zswap are more actively used in the data center and Android phone case, from swap IO ops point of view. > > > Even let's assume the SSD wants a free block over a nonfull cluster > > first. Zswap and zram swap are not subject to SSD property. We might > > want to have a kernel option to select using nonfree clusters over > > the free one for zram and zswap (ghost swapfile). That will help > > contain the fragmented swap area. > > I suspect that it will help fragmentation avoidance much. Please prove > its effectiveness with data firstly. It can be a further optimization > patch in the series. Take the above 1GB data written in a 16GB drive example. a) will fragment the whole 16GB drive. b) will concentrate on the first 1GB location that was used. > > Even if we really need it, we can try to do it without a kernel option. > For example, detect whether we are using zram and enable it for zram > automatically (through a general flag). zswap you need to have an option to choose from because it can write to the real swappfile as well. Do you optimize the swap allocator for the zswap or physical swapfile. Chris
Chris Li <chrisl@kernel.org> writes: > On Mon, Jun 24, 2024 at 7:36 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > This is the short term solutiolns "swap cluster order" listed >> >> > in my "Swap Abstraction" discussion slice 8 in the recent >> >> > LSF/MM conference. >> >> > >> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> >> > orders" is introduced, it only allocates the mTHP swap entries >> >> > from new empty cluster list. It has a fragmentation issue >> >> > reported by Barry. >> >> > >> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >> >> > >> >> > The reason is that all the empty cluster has been exhausted while >> >> > there are planty of free swap entries to in the cluster that is >> >> > not 100% free. >> >> > >> >> > Remember the swap allocation order in the cluster. >> >> > Keep track of the per order non full cluster list for later allocation. >> >> > >> >> > User impact: For users that allocate and free mix order mTHP swapping, >> >> > It greatly improves the success rate of the mTHP swap allocation after the >> >> > initial phase. >> >> > >> >> > Barry provides a test program to show the effect: >> >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ >> >> > >> >> > Without: >> >> > $ mthp-swapout >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% >> >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% >> >> > >> >> > $ mthp-swapout -s >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> >> > >> >> > With: >> >> > $ mthp-swapout >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > ... >> >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > >> >> > $ mthp-swapout -s >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > ... >> >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> >> Unfortunately, the data is gotten using a special designed test program >> >> which always swap-in pages with swapped-out size. I don't know whether >> >> such workloads exist in reality. Otherwise, you need to wait for mTHP >> > >> > The test program is designed to simulate mTHP swap behavior using >> > zsmalloc and 64KB buffer. >> > If we insist on only designing for existing workloads, then zsmalloc >> > using 64KB buffer usage will never be able to run, exactly due the >> > kernel has high failure rate allocating swap entries for 64KB. There >> > is a bit of a chick and egg problem there, such a usage can not exist >> > because the kernel can't support it yet. Kernel can't add patches to >> > support it because such simulation tests are not "real". >> > >> > We need to break this cycle to support something new. >> > >> >> swap-in to be merged firstly, and people reach consensus that we should >> >> always swap-in pages with swapped-out size. >> > >> > We don't have to be always. We can identify the situation that makes >> > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the >> > same swap in size makes sense. >> > I think we have agreement on such zsmalloc 64K usage cases we do want >> > to support. >> > >> >> >> >> Alternately, we can make some design adjustment to make the patchset >> >> work in current situation (mTHP swap-out, normal page swap-in). >> >> >> >> - One non-full cluster list for each order (same as current design) >> >> >> >> - When one swap entry is freed, check whether one "order+1" swap entry >> >> becomes free, if so, move the cluster to "order+1" non-full cluster >> >> list. >> > >> > In the intended zsmalloc usage case, there is no order+1 swap entry >> > request. >> >> This my main concern about this series. Only the Android use cases are >> considered. The general use cases are just ignored. Is it hard to >> consider or test a normal swap partition on your development machine? > > Please see the V4 cover letter. The V4 already has the SSD, zram and > HDD stress testing. > Of course I want to make sure the allocator works well with Barry's > mthp test case as well. > >> > Moving the cluster to "order+1" will make less cluster available for "order". >> > For that usage case it is negative gain. >> >> The "order+1" cluster can be used to allocate "order" cluster when >> existing "order" cluster is used up. >> >> And in this way, we can protect clusters with more free spaces so that >> they may become free. >> >> >> - When allocate swap entry with "order", get cluster from free, "order", >> >> "order+1", ... non-full cluster list. If all are empty, fallback to >> > >> > I don't see that it is useful for the zsmalloc 64K buffer usage case. >> > There will be order 0 and order 4 and nothing else. >> > >> > How about let's keep it simple for now. If we identify some workload >> > this algorithm can help. We can do that as a follow up step. >> >> The simple design isn't flexible enough for your workloads too. For >> example, >> >> - Initially, almost only order-0 pages are swapped out, most non-full >> clusters are order-0. >> >> - Later, quite some order-0 swap entries are freed so that there are >> quite some order-4 swap entries available. >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full >> clusters available. >> >> So, we need a way to migrate non-full clusters among orders to adjust to >> the situations automatically. > > Depends on how lucky it is to form the order-4 cluster naturally. The > odds of forming the order-4 cluster naturally in random swap > allocation/ free case is very low. I have the number in my other email > thread. > Anyway, if we convince this payout for the complexity it introduces, > we can do that as follow up steps. Try to keep things simple at first > for the review benefit. > >> >> >> order 0. >> >> >> >> Do you think that this works? >> >> >> >> > Reported-by: Barry Song <21cnbao@gmail.com> >> >> > Signed-off-by: Chris Li <chrisl@kernel.org> >> >> > --- >> >> > Changes in v3: >> >> > - Using V1 as base. >> >> > - Rename "next" to "list" for the list field, suggested by Ying. >> >> > - Update comment for the locking rules for cluster fields and list, >> >> > suggested by Ying. >> >> > - Allocate from the nonfull list before attempting free list, suggested >> >> > by Kairui. >> >> >> >> Haven't looked into this. It appears that this breaks the original >> >> discard behavior which helps performance of some SSD, please refer to >> > >> > Can you clarify by "discard" you mean SSD discard command or just the >> > way swap allocator recycles free clusters? >> >> The SSD discard command, like in the following URL, >> >> https://en.wikipedia.org/wiki/Trim_(computing) > > Thanks. I know what an SSD discard command is. Want to understand why > that behavior is preferred. > > So the reasoning to prefer a new free block rather than a recent > particle free cluster is to let the previous written cluster have a > higher chance to issue the discard command? > > This preferred new block behavior is actually not friendly to SSD from > a wearing point of view. > Take this example: > Let say the data need to allocate and free from swap. At any given > time the swap usage is 1G. The swap SSD drive is 16G. > Let say the allocation and free are at random 4K page locations. There > is totally 64G swap data needed to write to swap, but at any given > time there is only 1G data occupite on swapfile. > > a) If you always prefer new free blocks. Then the swap data will > eventually write at all 16G drives then random write to full 16G. > Chance of forming a free cluster so a discard command can be issued is > very low. (15/16)**512 = 4.4E-15. From SSD point of view, it does not > know most of the data written to 16G drive is not used. When a page is > free on a swapfile, SSD drive doesn't know about it. It sees 4K random > writes to all 16G of the drive, total 64G data written. > > b) If you always prefer a non full cluster first over a new cluster. > The 64G data will concentrate random writing to the first 1G of drive > location. Total 64G data written. > > I consider b) are more friendly to SSD than a). Because concentrate > the write into the first 1G location. The SSD can know the data > overwritten in those 1G has internally obsolete, so it can internally > GC the those overwritten data without a discard command. Where a) > random 4K writes to the whole drive without much discard at all. Full > SSD doing random writes is a bad combination from a wearing point of > view. > > Just my 2 cents. Anyway I revert the V4 to use free cluster before > nonfull cluster just to behave the same as previously. > >> >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). >> > >> > I did read that change log. Help me understand in more detail which >> > discard behavior you have in mind. A lot of low end micro SD cards >> > have proper FTL wear leveling now, ssd even better on that. >> >> It's not FTL, it's discard/trim for SSD as above. > > Thanks for the clarification. > >> >> >> And as pointed out by Ryan, this may reduce the opportunity of the >> >> sequential block device writing during swap-out, which may hurt >> >> performance of SSD too. >> > >> > Only at the initial phase. If the swap IO continues, after the first >> > pass fills up the swap file, the write will be random on the swapfile >> > anyway. Because the swapfile only issues 2M discards commands when all >> > 512 4K pages are free. The discarded area will be much smaller than >> > the free area on swapfile. That combined with the random write page on >> > the whole swap file. It might produce a worse internal write >> > amplification for SSD, compared to only writing a subset of the >> > swapfile area. I would love to hear from someone who understands SSD >> > internals to confirm or deny my theory. >> >> It depends on workloads. Some workloads will have more severe >> fragmentation than others. For example, on quite some machines, the >> swap devices will be far from being full to avoid possible OOM. > > I suspect most of the SSD swap on client devices nowadays are only as > backup just in case it needs to be swapped. > There is not much SSD swap IO during normal use. The zram and zswap > are more actively used in the data center and Android phone case, from > swap IO ops point of view. I use a Linux laptop with 16GB DRAM for work. And I found that the swap space are almost always used. >> >> > Even let's assume the SSD wants a free block over a nonfull cluster >> > first. Zswap and zram swap are not subject to SSD property. We might >> > want to have a kernel option to select using nonfree clusters over >> > the free one for zram and zswap (ghost swapfile). That will help >> > contain the fragmented swap area. >> >> I suspect that it will help fragmentation avoidance much. Please prove >> its effectiveness with data firstly. It can be a further optimization >> patch in the series. > > Take the above 1GB data written in a 16GB drive example. a) will > fragment the whole 16GB drive. > b) will concentrate on the first 1GB location that was used. > >> >> Even if we really need it, we can try to do it without a kernel option. >> For example, detect whether we are using zram and enable it for zram >> automatically (through a general flag). > > zswap you need to have an option to choose from because it can write > to the real swappfile as well. > Do you optimize the swap allocator for the zswap or physical swapfile. -- Best Regards, Huang, Ying
On Fri, Jul 26, 2024 at 12:01 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Mon, Jun 24, 2024 at 7:36 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chrisl@kernel.org> writes: > >> >> > >> >> > This is the short term solutiolns "swap cluster order" listed > >> >> > in my "Swap Abstraction" discussion slice 8 in the recent > >> >> > LSF/MM conference. > >> >> > > >> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> >> > orders" is introduced, it only allocates the mTHP swap entries > >> >> > from new empty cluster list. It has a fragmentation issue > >> >> > reported by Barry. > >> >> > > >> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > >> >> > > >> >> > The reason is that all the empty cluster has been exhausted while > >> >> > there are planty of free swap entries to in the cluster that is > >> >> > not 100% free. > >> >> > > >> >> > Remember the swap allocation order in the cluster. > >> >> > Keep track of the per order non full cluster list for later allocation. > >> >> > > >> >> > User impact: For users that allocate and free mix order mTHP swapping, > >> >> > It greatly improves the success rate of the mTHP swap allocation after the > >> >> > initial phase. > >> >> > > >> >> > Barry provides a test program to show the effect: > >> >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ > >> >> > > >> >> > Without: > >> >> > $ mthp-swapout > >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% > >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% > >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > >> >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% > >> >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > >> >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% > >> >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% > >> >> > > >> >> > $ mthp-swapout -s > >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% > >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% > >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% > >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% > >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% > >> >> > > >> >> > With: > >> >> > $ mthp-swapout > >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > ... > >> >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > > >> >> > $ mthp-swapout -s > >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > ... > >> >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% > >> >> > >> >> Unfortunately, the data is gotten using a special designed test program > >> >> which always swap-in pages with swapped-out size. I don't know whether > >> >> such workloads exist in reality. Otherwise, you need to wait for mTHP > >> > > >> > The test program is designed to simulate mTHP swap behavior using > >> > zsmalloc and 64KB buffer. > >> > If we insist on only designing for existing workloads, then zsmalloc > >> > using 64KB buffer usage will never be able to run, exactly due the > >> > kernel has high failure rate allocating swap entries for 64KB. There > >> > is a bit of a chick and egg problem there, such a usage can not exist > >> > because the kernel can't support it yet. Kernel can't add patches to > >> > support it because such simulation tests are not "real". > >> > > >> > We need to break this cycle to support something new. > >> > > >> >> swap-in to be merged firstly, and people reach consensus that we should > >> >> always swap-in pages with swapped-out size. > >> > > >> > We don't have to be always. We can identify the situation that makes > >> > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the > >> > same swap in size makes sense. > >> > I think we have agreement on such zsmalloc 64K usage cases we do want > >> > to support. > >> > > >> >> > >> >> Alternately, we can make some design adjustment to make the patchset > >> >> work in current situation (mTHP swap-out, normal page swap-in). > >> >> > >> >> - One non-full cluster list for each order (same as current design) > >> >> > >> >> - When one swap entry is freed, check whether one "order+1" swap entry > >> >> becomes free, if so, move the cluster to "order+1" non-full cluster > >> >> list. > >> > > >> > In the intended zsmalloc usage case, there is no order+1 swap entry > >> > request. > >> > >> This my main concern about this series. Only the Android use cases are > >> considered. The general use cases are just ignored. Is it hard to > >> consider or test a normal swap partition on your development machine? > > > > Please see the V4 cover letter. The V4 already has the SSD, zram and > > HDD stress testing. > > Of course I want to make sure the allocator works well with Barry's > > mthp test case as well. > > > >> > Moving the cluster to "order+1" will make less cluster available for "order". > >> > For that usage case it is negative gain. > >> > >> The "order+1" cluster can be used to allocate "order" cluster when > >> existing "order" cluster is used up. > >> > >> And in this way, we can protect clusters with more free spaces so that > >> they may become free. > >> > >> >> - When allocate swap entry with "order", get cluster from free, "order", > >> >> "order+1", ... non-full cluster list. If all are empty, fallback to > >> > > >> > I don't see that it is useful for the zsmalloc 64K buffer usage case. > >> > There will be order 0 and order 4 and nothing else. > >> > > >> > How about let's keep it simple for now. If we identify some workload > >> > this algorithm can help. We can do that as a follow up step. > >> > >> The simple design isn't flexible enough for your workloads too. For > >> example, > >> > >> - Initially, almost only order-0 pages are swapped out, most non-full > >> clusters are order-0. > >> > >> - Later, quite some order-0 swap entries are freed so that there are > >> quite some order-4 swap entries available. > >> > >> - Order-4 pages need to be swapped out, but no enough order-4 non-full > >> clusters available. > >> > >> So, we need a way to migrate non-full clusters among orders to adjust to > >> the situations automatically. > > > > Depends on how lucky it is to form the order-4 cluster naturally. The > > odds of forming the order-4 cluster naturally in random swap > > allocation/ free case is very low. I have the number in my other email > > thread. > > Anyway, if we convince this payout for the complexity it introduces, > > we can do that as follow up steps. Try to keep things simple at first > > for the review benefit. > > > >> > >> >> order 0. > >> >> > >> >> Do you think that this works? > >> >> > >> >> > Reported-by: Barry Song <21cnbao@gmail.com> > >> >> > Signed-off-by: Chris Li <chrisl@kernel.org> > >> >> > --- > >> >> > Changes in v3: > >> >> > - Using V1 as base. > >> >> > - Rename "next" to "list" for the list field, suggested by Ying. > >> >> > - Update comment for the locking rules for cluster fields and list, > >> >> > suggested by Ying. > >> >> > - Allocate from the nonfull list before attempting free list, suggested > >> >> > by Kairui. > >> >> > >> >> Haven't looked into this. It appears that this breaks the original > >> >> discard behavior which helps performance of some SSD, please refer to > >> > > >> > Can you clarify by "discard" you mean SSD discard command or just the > >> > way swap allocator recycles free clusters? > >> > >> The SSD discard command, like in the following URL, > >> > >> https://en.wikipedia.org/wiki/Trim_(computing) > > > > Thanks. I know what an SSD discard command is. Want to understand why > > that behavior is preferred. > > > > So the reasoning to prefer a new free block rather than a recent > > particle free cluster is to let the previous written cluster have a > > higher chance to issue the discard command? > > > > This preferred new block behavior is actually not friendly to SSD from > > a wearing point of view. > > Take this example: > > Let say the data need to allocate and free from swap. At any given > > time the swap usage is 1G. The swap SSD drive is 16G. > > Let say the allocation and free are at random 4K page locations. There > > is totally 64G swap data needed to write to swap, but at any given > > time there is only 1G data occupite on swapfile. > > > > a) If you always prefer new free blocks. Then the swap data will > > eventually write at all 16G drives then random write to full 16G. > > Chance of forming a free cluster so a discard command can be issued is > > very low. (15/16)**512 = 4.4E-15. From SSD point of view, it does not > > know most of the data written to 16G drive is not used. When a page is > > free on a swapfile, SSD drive doesn't know about it. It sees 4K random > > writes to all 16G of the drive, total 64G data written. > > > > b) If you always prefer a non full cluster first over a new cluster. > > The 64G data will concentrate random writing to the first 1G of drive > > location. Total 64G data written. > > > > I consider b) are more friendly to SSD than a). Because concentrate > > the write into the first 1G location. The SSD can know the data > > overwritten in those 1G has internally obsolete, so it can internally > > GC the those overwritten data without a discard command. Where a) > > random 4K writes to the whole drive without much discard at all. Full > > SSD doing random writes is a bad combination from a wearing point of > > view. > > > > Just my 2 cents. Anyway I revert the V4 to use free cluster before > > nonfull cluster just to behave the same as previously. > > > >> >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). > >> > > >> > I did read that change log. Help me understand in more detail which > >> > discard behavior you have in mind. A lot of low end micro SD cards > >> > have proper FTL wear leveling now, ssd even better on that. > >> > >> It's not FTL, it's discard/trim for SSD as above. > > > > Thanks for the clarification. > > > >> > >> >> And as pointed out by Ryan, this may reduce the opportunity of the > >> >> sequential block device writing during swap-out, which may hurt > >> >> performance of SSD too. > >> > > >> > Only at the initial phase. If the swap IO continues, after the first > >> > pass fills up the swap file, the write will be random on the swapfile > >> > anyway. Because the swapfile only issues 2M discards commands when all > >> > 512 4K pages are free. The discarded area will be much smaller than > >> > the free area on swapfile. That combined with the random write page on > >> > the whole swap file. It might produce a worse internal write > >> > amplification for SSD, compared to only writing a subset of the > >> > swapfile area. I would love to hear from someone who understands SSD > >> > internals to confirm or deny my theory. > >> > >> It depends on workloads. Some workloads will have more severe > >> fragmentation than others. For example, on quite some machines, the > >> swap devices will be far from being full to avoid possible OOM. > > > > I suspect most of the SSD swap on client devices nowadays are only as > > backup just in case it needs to be swapped. > > There is not much SSD swap IO during normal use. The zram and zswap > > are more actively used in the data center and Android phone case, from > > swap IO ops point of view. > > I use a Linux laptop with 16GB DRAM for work. And I found that the swap > space are almost always used. Just curious how many swap OPS per second on average? I suspect it will be a very low number. Chris > > >> > >> > Even let's assume the SSD wants a free block over a nonfull cluster > >> > first. Zswap and zram swap are not subject to SSD property. We might > >> > want to have a kernel option to select using nonfree clusters over > >> > the free one for zram and zswap (ghost swapfile). That will help > >> > contain the fragmented swap area. > >> > >> I suspect that it will help fragmentation avoidance much. Please prove > >> its effectiveness with data firstly. It can be a further optimization > >> patch in the series. > > > > Take the above 1GB data written in a 16GB drive example. a) will > > fragment the whole 16GB drive. > > b) will concentrate on the first 1GB location that was used. > > > >> > >> Even if we really need it, we can try to do it without a kernel option. > >> For example, detect whether we are using zram and enable it for zram > >> automatically (through a general flag). > > > > zswap you need to have an option to choose from because it can write > > to the real swappfile as well. > > Do you optimize the swap allocator for the zswap or physical swapfile. > > -- > Best Regards, > Huang, Ying
Chris Li <chrisl@kernel.org> writes: > On Fri, Jul 26, 2024 at 12:01 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Mon, Jun 24, 2024 at 7:36 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> >> >> > This is the short term solutiolns "swap cluster order" listed >> >> >> > in my "Swap Abstraction" discussion slice 8 in the recent >> >> >> > LSF/MM conference. >> >> >> > >> >> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> >> >> > orders" is introduced, it only allocates the mTHP swap entries >> >> >> > from new empty cluster list. It has a fragmentation issue >> >> >> > reported by Barry. >> >> >> > >> >> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >> >> >> > >> >> >> > The reason is that all the empty cluster has been exhausted while >> >> >> > there are planty of free swap entries to in the cluster that is >> >> >> > not 100% free. >> >> >> > >> >> >> > Remember the swap allocation order in the cluster. >> >> >> > Keep track of the per order non full cluster list for later allocation. >> >> >> > >> >> >> > User impact: For users that allocate and free mix order mTHP swapping, >> >> >> > It greatly improves the success rate of the mTHP swap allocation after the >> >> >> > initial phase. >> >> >> > >> >> >> > Barry provides a test program to show the effect: >> >> >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ >> >> >> > >> >> >> > Without: >> >> >> > $ mthp-swapout >> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% >> >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% >> >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> >> >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% >> >> >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> >> >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% >> >> >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% >> >> >> > >> >> >> > $ mthp-swapout -s >> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% >> >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% >> >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% >> >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% >> >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% >> >> >> > >> >> >> > With: >> >> >> > $ mthp-swapout >> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > ... >> >> >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > >> >> >> > $ mthp-swapout -s >> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > ... >> >> >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% >> >> >> >> >> >> Unfortunately, the data is gotten using a special designed test program >> >> >> which always swap-in pages with swapped-out size. I don't know whether >> >> >> such workloads exist in reality. Otherwise, you need to wait for mTHP >> >> > >> >> > The test program is designed to simulate mTHP swap behavior using >> >> > zsmalloc and 64KB buffer. >> >> > If we insist on only designing for existing workloads, then zsmalloc >> >> > using 64KB buffer usage will never be able to run, exactly due the >> >> > kernel has high failure rate allocating swap entries for 64KB. There >> >> > is a bit of a chick and egg problem there, such a usage can not exist >> >> > because the kernel can't support it yet. Kernel can't add patches to >> >> > support it because such simulation tests are not "real". >> >> > >> >> > We need to break this cycle to support something new. >> >> > >> >> >> swap-in to be merged firstly, and people reach consensus that we should >> >> >> always swap-in pages with swapped-out size. >> >> > >> >> > We don't have to be always. We can identify the situation that makes >> >> > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the >> >> > same swap in size makes sense. >> >> > I think we have agreement on such zsmalloc 64K usage cases we do want >> >> > to support. >> >> > >> >> >> >> >> >> Alternately, we can make some design adjustment to make the patchset >> >> >> work in current situation (mTHP swap-out, normal page swap-in). >> >> >> >> >> >> - One non-full cluster list for each order (same as current design) >> >> >> >> >> >> - When one swap entry is freed, check whether one "order+1" swap entry >> >> >> becomes free, if so, move the cluster to "order+1" non-full cluster >> >> >> list. >> >> > >> >> > In the intended zsmalloc usage case, there is no order+1 swap entry >> >> > request. >> >> >> >> This my main concern about this series. Only the Android use cases are >> >> considered. The general use cases are just ignored. Is it hard to >> >> consider or test a normal swap partition on your development machine? >> > >> > Please see the V4 cover letter. The V4 already has the SSD, zram and >> > HDD stress testing. >> > Of course I want to make sure the allocator works well with Barry's >> > mthp test case as well. >> > >> >> > Moving the cluster to "order+1" will make less cluster available for "order". >> >> > For that usage case it is negative gain. >> >> >> >> The "order+1" cluster can be used to allocate "order" cluster when >> >> existing "order" cluster is used up. >> >> >> >> And in this way, we can protect clusters with more free spaces so that >> >> they may become free. >> >> >> >> >> - When allocate swap entry with "order", get cluster from free, "order", >> >> >> "order+1", ... non-full cluster list. If all are empty, fallback to >> >> > >> >> > I don't see that it is useful for the zsmalloc 64K buffer usage case. >> >> > There will be order 0 and order 4 and nothing else. >> >> > >> >> > How about let's keep it simple for now. If we identify some workload >> >> > this algorithm can help. We can do that as a follow up step. >> >> >> >> The simple design isn't flexible enough for your workloads too. For >> >> example, >> >> >> >> - Initially, almost only order-0 pages are swapped out, most non-full >> >> clusters are order-0. >> >> >> >> - Later, quite some order-0 swap entries are freed so that there are >> >> quite some order-4 swap entries available. >> >> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full >> >> clusters available. >> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to >> >> the situations automatically. >> > >> > Depends on how lucky it is to form the order-4 cluster naturally. The >> > odds of forming the order-4 cluster naturally in random swap >> > allocation/ free case is very low. I have the number in my other email >> > thread. >> > Anyway, if we convince this payout for the complexity it introduces, >> > we can do that as follow up steps. Try to keep things simple at first >> > for the review benefit. >> > >> >> >> >> >> order 0. >> >> >> >> >> >> Do you think that this works? >> >> >> >> >> >> > Reported-by: Barry Song <21cnbao@gmail.com> >> >> >> > Signed-off-by: Chris Li <chrisl@kernel.org> >> >> >> > --- >> >> >> > Changes in v3: >> >> >> > - Using V1 as base. >> >> >> > - Rename "next" to "list" for the list field, suggested by Ying. >> >> >> > - Update comment for the locking rules for cluster fields and list, >> >> >> > suggested by Ying. >> >> >> > - Allocate from the nonfull list before attempting free list, suggested >> >> >> > by Kairui. >> >> >> >> >> >> Haven't looked into this. It appears that this breaks the original >> >> >> discard behavior which helps performance of some SSD, please refer to >> >> > >> >> > Can you clarify by "discard" you mean SSD discard command or just the >> >> > way swap allocator recycles free clusters? >> >> >> >> The SSD discard command, like in the following URL, >> >> >> >> https://en.wikipedia.org/wiki/Trim_(computing) >> > >> > Thanks. I know what an SSD discard command is. Want to understand why >> > that behavior is preferred. >> > >> > So the reasoning to prefer a new free block rather than a recent >> > particle free cluster is to let the previous written cluster have a >> > higher chance to issue the discard command? >> > >> > This preferred new block behavior is actually not friendly to SSD from >> > a wearing point of view. >> > Take this example: >> > Let say the data need to allocate and free from swap. At any given >> > time the swap usage is 1G. The swap SSD drive is 16G. >> > Let say the allocation and free are at random 4K page locations. There >> > is totally 64G swap data needed to write to swap, but at any given >> > time there is only 1G data occupite on swapfile. >> > >> > a) If you always prefer new free blocks. Then the swap data will >> > eventually write at all 16G drives then random write to full 16G. >> > Chance of forming a free cluster so a discard command can be issued is >> > very low. (15/16)**512 = 4.4E-15. From SSD point of view, it does not >> > know most of the data written to 16G drive is not used. When a page is >> > free on a swapfile, SSD drive doesn't know about it. It sees 4K random >> > writes to all 16G of the drive, total 64G data written. >> > >> > b) If you always prefer a non full cluster first over a new cluster. >> > The 64G data will concentrate random writing to the first 1G of drive >> > location. Total 64G data written. >> > >> > I consider b) are more friendly to SSD than a). Because concentrate >> > the write into the first 1G location. The SSD can know the data >> > overwritten in those 1G has internally obsolete, so it can internally >> > GC the those overwritten data without a discard command. Where a) >> > random 4K writes to the whole drive without much discard at all. Full >> > SSD doing random writes is a bad combination from a wearing point of >> > view. >> > >> > Just my 2 cents. Anyway I revert the V4 to use free cluster before >> > nonfull cluster just to behave the same as previously. >> > >> >> >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD"). >> >> > >> >> > I did read that change log. Help me understand in more detail which >> >> > discard behavior you have in mind. A lot of low end micro SD cards >> >> > have proper FTL wear leveling now, ssd even better on that. >> >> >> >> It's not FTL, it's discard/trim for SSD as above. >> > >> > Thanks for the clarification. >> > >> >> >> >> >> And as pointed out by Ryan, this may reduce the opportunity of the >> >> >> sequential block device writing during swap-out, which may hurt >> >> >> performance of SSD too. >> >> > >> >> > Only at the initial phase. If the swap IO continues, after the first >> >> > pass fills up the swap file, the write will be random on the swapfile >> >> > anyway. Because the swapfile only issues 2M discards commands when all >> >> > 512 4K pages are free. The discarded area will be much smaller than >> >> > the free area on swapfile. That combined with the random write page on >> >> > the whole swap file. It might produce a worse internal write >> >> > amplification for SSD, compared to only writing a subset of the >> >> > swapfile area. I would love to hear from someone who understands SSD >> >> > internals to confirm or deny my theory. >> >> >> >> It depends on workloads. Some workloads will have more severe >> >> fragmentation than others. For example, on quite some machines, the >> >> swap devices will be far from being full to avoid possible OOM. >> > >> > I suspect most of the SSD swap on client devices nowadays are only as >> > backup just in case it needs to be swapped. >> > There is not much SSD swap IO during normal use. The zram and zswap >> > are more actively used in the data center and Android phone case, from >> > swap IO ops point of view. >> >> I use a Linux laptop with 16GB DRAM for work. And I found that the swap >> space are almost always used. > > Just curious how many swap OPS per second on average? I suspect it > will be a very low number. It depends on workloads. I have run some LLM pruning experiment algorithm on the machine. The swap IOPS is high for that. [snip] -- Best Regards, Huang, Ying
This is the short term solutiolns "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference. When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from new empty cluster list. It has a fragmentation issue reported by Barry. https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ The reason is that all the empty cluster has been exhausted while there are planty of free swap entries to in the cluster that is not 100% free. Remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation. User impact: For users that allocate and free mix order mTHP swapping, It greatly improves the success rate of the mTHP swap allocation after the initial phase. Barry provides a test program to show the effect: https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ Without: $ mthp-swapout Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54% Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00% Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00% $ mthp-swapout -s Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65% Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00% Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00% Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00% With: $ mthp-swapout Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% ... Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% $ mthp-swapout -s Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00% ... Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00% Reported-by: Barry Song <21cnbao@gmail.com> Signed-off-by: Chris Li <chrisl@kernel.org> --- Changes in v3: - Using V1 as base. - Rename "next" to "list" for the list field, suggested by Ying. - Update comment for the locking rules for cluster fields and list, suggested by Ying. - Allocate from the nonfull list before attempting free list, suggested by Kairui. - Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org Changes in v2: - Abandoned. - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org --- Chris Li (2): mm: swap: swap cluster switch to double link list mm: swap: mTHP allocate swap entries from nonfull list include/linux/swap.h | 30 +++---- mm/swapfile.c | 248 +++++++++++++++++---------------------------------- 2 files changed, 95 insertions(+), 183 deletions(-) --- base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f change-id: 20240523-swap-allocator-1534c480ece4 Best regards,