diff mbox series

Swap Min Odrer

Message ID 20250107094347.l37isnk3w2nmpx2i@AALNPWDAGOMEZ1.aal.scsc.local (mailing list archive)
State New
Headers show
Series Swap Min Odrer | expand

Commit Message

Daniel Gomez Jan. 7, 2025, 9:43 a.m. UTC
Hi,

High-capacity SSDs require writes to be aligned with the drive's
indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
support swap on these devices, we need to ensure that writes do not
cross IU boundaries. So, I think this may require increasing the minimum
allocation size for swap users.

As a temporary alternative, a proposal [1] to prevent swap on these
devices was previously sent for discussion before LBS was merged
in v6.12 [2]. Additional details and reasoning can be found in [1]
discussion.

[1] https://lore.kernel.org/all/20240627000924.2074949-1-mcgrof@kernel.org/
[2] https://lore.kernel.org/all/20240913-vfs-blocksize-ab40822b2366@brauner/

So, I’d like to bring this up for discussion here and/or propose it as
a topic for the next MM bi-weekly meeting if needed. Please let me know
if this has already been discussed previously. Given that we already
support large folios with mTHP in anon memory and shmem, a similar
approach where we avoid falling back to smaller allocations might
suffice, as it is done in the page cache with min order.

Monitoring writes on a dedicated NVMe with swap enabled with blkalgn
tool [3], I get the following results:

[3] https://github.com/iovisor/bcc/pull/5128

Swap setup:
mkdir -p /mnt/swap
sudo mkfs.xfs -b size=16k /dev/nvme0n1 -f
sudo mount --types xfs /dev/nvme0n1 /mnt/swap

sudo fallocate -l 8192M /mnt/swap/swapfile
sudo chmod 600 /mnt/swap/swapfile
sudo mkswap /mnt/swap/swapfile
sudo swapon /mnt/swap/swapfile

Swap stress test (guest with 7.8Gi of RAM):
stress --vm-bytes 7859M --vm-keep -m 1 --timeout 300

Results:
1. Vanilla v6.12 no mTHP enabled

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 3255     |****************************************|
      8192 -> 16383      : 783      |*********                               |
     16384 -> 32767      : 255      |***                                     |
     32768 -> 65535      : 61       |                                        |
     65536 -> 131071     : 24       |                                        |
    131072 -> 262143     : 22       |                                        |
    262144 -> 524287     : 2136     |**************************              |

The above represents the alignment of writes in power-of-2 steps for the
swap dedicated nvme0n1 device. The corresponding granularity for these
alignments is shown in the linear histogram below, where the sector
size is 512 Bytes (e.g. for a sector size 8: 8 << 9: 4096 Bytes). So
the first count indicates that 821 writes where sent with a size of 4
KiB, and the last one shows that 2441 writes where sent with a size of
512 KiB.

I/O Granularity Histogram for Device nvme0n1
Total I/Os: 6536
     sector        : count     distribution
        8          : 821      |*************                           |
        16         : 131      |**                                      |
        24         : 339      |*****                                   |
        32         : 259      |****                                    |
        40         : 114      |*                                       |
        48         : 162      |**                                      |
        56         : 249      |****                                    |
        64         : 257      |****                                    |
        72         : 157      |**                                      |
        80         : 90       |*                                       |
        88         : 109      |*                                       |
        96         : 188      |***                                     |
        104        : 228      |***                                     |
        112        : 262      |****                                    |
        120        : 81       |*                                       |
        128        : 44       |                                        |
        136        : 22       |                                        |
        144        : 20       |                                        |
        152        : 20       |                                        |
        160        : 18       |                                        |
        168        : 43       |                                        |
        176        : 9        |                                        |
        184        : 5        |                                        |
        192        : 2        |                                        |
        200        : 3        |                                        |
        208        : 2        |                                        |
        216        : 4        |                                        |
        224        : 6        |                                        |
        232        : 4        |                                        |
        240        : 2        |                                        |
        248        : 11       |                                        |
        256        : 9        |                                        |
        264        : 17       |                                        |
        272        : 19       |                                        |
        280        : 16       |                                        |
        288        : 7        |                                        |
        296        : 5        |                                        |
        304        : 2        |                                        |
        312        : 7        |                                        |
        320        : 5        |                                        |
        328        : 4        |                                        |
        336        : 23       |                                        |
        344        : 2        |                                        |
        352        : 12       |                                        |
        360        : 5        |                                        |
        368        : 5        |                                        |
        376        : 1        |                                        |
        384        : 3        |                                        |
        392        : 3        |                                        |
        400        : 2        |                                        |
        408        : 1        |                                        |
        416        : 1        |                                        |
        424        : 6        |                                        |
        432        : 5        |                                        |
        440        : 3        |                                        |
        448        : 7        |                                        |
        456        : 2        |                                        |
        472        : 2        |                                        |
        480        : 2        |                                        |
        488        : 7        |                                        |
        496        : 5        |                                        |
        504        : 11       |                                        |
        520        : 3        |                                        |
        528        : 1        |                                        |
        536        : 2        |                                        |
        544        : 5        |                                        |
        560        : 1        |                                        |
        568        : 2        |                                        |
        576        : 1        |                                        |
        584        : 2        |                                        |
        592        : 2        |                                        |
        600        : 2        |                                        |
        608        : 1        |                                        |
        616        : 2        |                                        |
        624        : 5        |                                        |
        632        : 1        |                                        |
        640        : 1        |                                        |
        648        : 1        |                                        |
        656        : 5        |                                        |
        664        : 8        |                                        |
        672        : 20       |                                        |
        680        : 3        |                                        |
        688        : 1        |                                        |
        704        : 1        |                                        |
        712        : 1        |                                        |
        720        : 3        |                                        |
        728        : 4        |                                        |
        736        : 6        |                                        |
        744        : 14       |                                        |
        752        : 14       |                                        |
        760        : 12       |                                        |
        768        : 3        |                                        |
        776        : 5        |                                        |
        784        : 2        |                                        |
        792        : 2        |                                        |
        800        : 1        |                                        |
        808        : 3        |                                        |
        816        : 1        |                                        |
        824        : 5        |                                        |
        832        : 2        |                                        |
        840        : 15       |                                        |
        848        : 9        |                                        |
        856        : 2        |                                        |
        864        : 1        |                                        |
        872        : 2        |                                        |
        880        : 10       |                                        |
        888        : 4        |                                        |
        896        : 5        |                                        |
        904        : 1        |                                        |
        920        : 2        |                                        |
        936        : 3        |                                        |
        944        : 1        |                                        |
        952        : 6        |                                        |
        960        : 1        |                                        |
        968        : 1        |                                        |
        976        : 1        |                                        |
        984        : 1        |                                        |
        992        : 2        |                                        |
        1000       : 2        |                                        |
        1008       : 16       |                                        |
        1016       : 1        |                                        |
        1024       : 2441     |****************************************|

2. Vanilla v6.12 with all mTHP enabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 5076     |****************************************|
      8192 -> 16383      : 907      |*******                                 |
     16384 -> 32767      : 302      |**                                      |
     32768 -> 65535      : 141      |*                                       |
     65536 -> 131071     : 46       |                                        |
    131072 -> 262143     : 35       |                                        |
    262144 -> 524287     : 1993     |***************                         |
    524288 -> 1048575    : 6        |                                        |

In addition, I've tested and monitored writes enabling SWP_BLKDEV for
regular files to allow large folios for swap files on block devices and
check the difference:


With the following aligment results:

3. v6.12 + SWP_BLKDEV change with mTHP disabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 146      |*****                                   |
      8192 -> 16383      : 23       |                                        |
     16384 -> 32767      : 10       |                                        |
     32768 -> 65535      : 1        |                                        |
     65536 -> 131071     : 3        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 1020     |****************************************|

4. v6.12 + SWP_BLKDEV change with mTHP enabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 240      |******                                  |
      8192 -> 16383      : 34       |                                        |
     16384 -> 32767      : 4        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 1542     |****************************************|

2nd run:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 356      |************                            |
      8192 -> 16383      : 74       |**                                      |
     16384 -> 32767      : 58       |**                                      |
     32768 -> 65535      : 54       |*                                       |
     65536 -> 131071     : 37       |*                                       |
    131072 -> 262143     : 11       |                                        |
    262144 -> 524287     : 1104     |****************************************|
    524288 -> 1048575    : 1        |                                        |

For comparison, the graph below represents a stress test on a drive with
LBS enabled (XFS with 16k block size) with random size writes:

I/O Alignment Histogram for Device nvme0n1
     Bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 1758     |*                                       |
      1024 -> 2047       : 476      |                                        |
      2048 -> 4095       : 164      |                                        |
      4096 -> 8191       : 42       |                                        |
      8192 -> 16383      : 10       |                                        |
     16384 -> 32767      : 3629     |***                                     |
     32768 -> 65535      : 47861    |****************************************|
     65536 -> 131071     : 25702    |*********************                   |
    131072 -> 262143     : 10791    |*********                               |
    262144 -> 524287     : 11094    |*********                               |
    524288 -> 1048575    : 55       |                                        |

The test drive here uses a 512 Byte LBA format and so, writes can start
at that boundary. However, LBS/min order allows most of the writes to
fall at 16k bounaries or greater.

What do you think?

Daniel

Comments

David Hildenbrand Jan. 7, 2025, 10:31 a.m. UTC | #1
On 07.01.25 10:43, Daniel Gomez wrote:
> Hi,

Hi,

> 
> High-capacity SSDs require writes to be aligned with the drive's
> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> support swap on these devices, we need to ensure that writes do not
> cross IU boundaries. So, I think this may require increasing the minimum
> allocation size for swap users.

How would we handle swapout/swapin when we have smaller pages (just 
imagine someone does a mmap(4KiB))?

Could this be something that gets abstracted/handled by the swap 
implementation? (i.e., multiple small folios get added to the swapcache 
but get written out / read in as a single unit?).

I recall that we have been talking about a better swap abstraction for 
years :)

Might be a good topic for LSF/MM (might or might not be a better place 
than the MM alignment session).
Daniel Gomez Jan. 7, 2025, 12:29 p.m. UTC | #2
On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> On 07.01.25 10:43, Daniel Gomez wrote:
> > Hi,
> 
> Hi,
> 
> > 
> > High-capacity SSDs require writes to be aligned with the drive's
> > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> > support swap on these devices, we need to ensure that writes do not
> > cross IU boundaries. So, I think this may require increasing the minimum
> > allocation size for swap users.
> 
> How would we handle swapout/swapin when we have smaller pages (just imagine
> someone does a mmap(4KiB))?

Swapout would require to be aligned to the IU. An mmap of 4 KiB would
have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
potential RMW penalty. So, I think aligning the mmap allocation to the
IU would guarantee a write of the required granularity and alignment.
But let's also look at your suggestion below with swapcache.

Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
the same write penalty implications, and only affecting performance
if I/Os are not conformant to these boundaries. So, reading at IU
boundaries is preferred to get optimal performance, not a 'requirement'.

> 
> Could this be something that gets abstracted/handled by the swap
> implementation? (i.e., multiple small folios get added to the swapcache but
> get written out / read in as a single unit?).

Do you mean merging like in the block layer? I'm not entirely sure if
this could guarantee deterministically the I/O boundaries the same way
it does min order large folio allocations in the page cache. But I guess
is worth exploring as optimization.

> 
> I recall that we have been talking about a better swap abstraction for years
> :)

Adding Chris Li to the cc list in case he has more input.

> 
> Might be a good topic for LSF/MM (might or might not be a better place than
> the MM alignment session).

Both options work for me. LSF/MM is in 12 weeks so, having a previous
session would be great.

Daniel

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Jan. 7, 2025, 4:41 p.m. UTC | #3
On 07.01.25 13:29, Daniel Gomez wrote:
> On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
>> On 07.01.25 10:43, Daniel Gomez wrote:
>>> Hi,
>>
>> Hi,
>>
>>>
>>> High-capacity SSDs require writes to be aligned with the drive's
>>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
>>> support swap on these devices, we need to ensure that writes do not
>>> cross IU boundaries. So, I think this may require increasing the minimum
>>> allocation size for swap users.
>>
>> How would we handle swapout/swapin when we have smaller pages (just imagine
>> someone does a mmap(4KiB))?
> 
> Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> potential RMW penalty. So, I think aligning the mmap allocation to the
> IU would guarantee a write of the required granularity and alignment.

We must be prepared to handle and VMA layout with single-page VMAs, 
single-page holes etc ... :/ IMHO we should try to handle this 
transparently to the application.

> But let's also look at your suggestion below with swapcache.
> 
> Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> the same write penalty implications, and only affecting performance
> if I/Os are not conformant to these boundaries. So, reading at IU
> boundaries is preferred to get optimal performance, not a 'requirement'.
> 
>>
>> Could this be something that gets abstracted/handled by the swap
>> implementation? (i.e., multiple small folios get added to the swapcache but
>> get written out / read in as a single unit?).
> 
> Do you mean merging like in the block layer? I'm not entirely sure if
> this could guarantee deterministically the I/O boundaries the same way
> it does min order large folio allocations in the page cache. But I guess
> is worth exploring as optimization.

Maybe the swapcache could somehow abstract that? We currently have the 
swap slot allocator, that assigns slots to pages.

Assuming we have a 16 KiB BS but a 4 KiB page, we might have various 
options to explore.

For example, we could size swap slots 16 KiB, and assign even 4 KiB 
pages a single slot. This would waste swap space with small folios, that 
would go away with large folios.

If we stick to 4 KiB swap slots, maybe pageout() could be taught to 
effectively writeback "everything" residing in the relevant swap slots 
that span a BS?

I recall there was a discussion about atomic writes involving multiple 
pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely 
no expert on that, unfortunately. Hoping Chris has some ideas.


> 
>>
>> I recall that we have been talking about a better swap abstraction for years
>> :)
> 
> Adding Chris Li to the cc list in case he has more input.
> 
>>
>> Might be a good topic for LSF/MM (might or might not be a better place than
>> the MM alignment session).
> 
> Both options work for me. LSF/MM is in 12 weeks so, having a previous
> session would be great.

Both work for me.
Daniel Gomez Jan. 8, 2025, 2:14 p.m. UTC | #4
On Tue, Jan 07, 2025 at 05:41:23PM +0100, David Hildenbrand wrote:
> On 07.01.25 13:29, Daniel Gomez wrote:
> > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> > > On 07.01.25 10:43, Daniel Gomez wrote:
> > > > Hi,
> > > 
> > > Hi,
> > > 
> > > > 
> > > > High-capacity SSDs require writes to be aligned with the drive's
> > > > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> > > > support swap on these devices, we need to ensure that writes do not
> > > > cross IU boundaries. So, I think this may require increasing the minimum
> > > > allocation size for swap users.
> > > 
> > > How would we handle swapout/swapin when we have smaller pages (just imagine
> > > someone does a mmap(4KiB))?
> > 
> > Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> > potential RMW penalty. So, I think aligning the mmap allocation to the
> > IU would guarantee a write of the required granularity and alignment.
> 
> We must be prepared to handle and VMA layout with single-page VMAs,
> single-page holes etc ... :/ IMHO we should try to handle this transparently
> to the application.
> 
> > But let's also look at your suggestion below with swapcache.
> > 
> > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> > the same write penalty implications, and only affecting performance
> > if I/Os are not conformant to these boundaries. So, reading at IU
> > boundaries is preferred to get optimal performance, not a 'requirement'.
> > 
> > > 
> > > Could this be something that gets abstracted/handled by the swap
> > > implementation? (i.e., multiple small folios get added to the swapcache but
> > > get written out / read in as a single unit?).
> > 
> > Do you mean merging like in the block layer? I'm not entirely sure if
> > this could guarantee deterministically the I/O boundaries the same way
> > it does min order large folio allocations in the page cache. But I guess
> > is worth exploring as optimization.
> 
> Maybe the swapcache could somehow abstract that? We currently have the swap
> slot allocator, that assigns slots to pages.
> 
> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options
> to explore.
> 
> For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a
> single slot. This would waste swap space with small folios, that would go
> away with large folios.

So batching order-0 folios in bigger slots that match the FS BS (e.g. 16
KiB) to perform disk writes, right? Can we also assign different orders
to the same slot? And can we batch folios while keeping alignment to the
BS (IU)?

> 
> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
> effectively writeback "everything" residing in the relevant swap slots that
> span a BS?
> 
> I recall there was a discussion about atomic writes involving multiple
> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely no
> expert on that, unfortunately. Hoping Chris has some ideas.

Not sure about the discussion but I guess the main concern for atomic
and swaping is the alignment and the questions I raised above.

> 
> 
> > 
> > > 
> > > I recall that we have been talking about a better swap abstraction for years
> > > :)
> > 
> > Adding Chris Li to the cc list in case he has more input.
> > 
> > > 
> > > Might be a good topic for LSF/MM (might or might not be a better place than
> > > the MM alignment session).
> > 
> > Both options work for me. LSF/MM is in 12 weeks so, having a previous
> > session would be great.
> 
> Both work for me.

Can we start by scheduling this topic for the next available MM session?
Would be great to get initial feedback/thoughts/concerns, etc while we
keep this thread going on.

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Jan. 8, 2025, 8:36 p.m. UTC | #5
>> Maybe the swapcache could somehow abstract that? We currently have the swap
>> slot allocator, that assigns slots to pages.
>>
>> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options
>> to explore.
>>
>> For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a
>> single slot. This would waste swap space with small folios, that would go
>> away with large folios.
> 
> So batching order-0 folios in bigger slots that match the FS BS (e.g. 16
> KiB) to perform disk writes, right?

Batching might be one idea, but the first idea I raised here would be 
that the swap slot size will match the BS (e.g., 16 KiB) and contain at 
most one folio.

So a order-0 folio would get a single slot assigned and effectively 
"waste" 12 KiB of disk space.

An order-2 folio would get a single slot assigned and not waste any memory.

An order-3 folio would get two slots assigned etc. (similar to how it is 
done today for non-order-0 folios)

So the penalty for using small folios would be more wasted disk space on 
such devices.

Can we also assign different orders
> to the same slot?

I guess yes.

And can we batch folios while keeping alignment to the
> BS (IU)?

I assume with "batching" you would mean that we could actually have 
multiple folios inside a single BS, like up to 4 order-0 folios in a 
single 16 KiB block? That might be one way of doing it, although I 
suspect this can get a bit complicated.

IIUC, we can perform 4 KiB read/write, but we must only have a single 
write per block, because otherwise we might get the RMW problems, 
correct? Then, maybe a mechanism to guarantee that only a single swap 
writeback within a BS can happen at one point in time might also be an 
alternative.

> 
>>
>> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
>> effectively writeback "everything" residing in the relevant swap slots that
>> span a BS?
>>
>> I recall there was a discussion about atomic writes involving multiple
>> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely no
>> expert on that, unfortunately. Hoping Chris has some ideas.
> 
> Not sure about the discussion but I guess the main concern for atomic
> and swaping is the alignment and the questions I raised above.

Yes, I think that's similar.

> 
>>
>>
>>>
>>>>
>>>> I recall that we have been talking about a better swap abstraction for years
>>>> :)
>>>
>>> Adding Chris Li to the cc list in case he has more input.
>>>
>>>>
>>>> Might be a good topic for LSF/MM (might or might not be a better place than
>>>> the MM alignment session).
>>>
>>> Both options work for me. LSF/MM is in 12 weeks so, having a previous
>>> session would be great.
>>
>> Both work for me.
> 
> Can we start by scheduling this topic for the next available MM session?
> Would be great to get initial feedback/thoughts/concerns, etc while we
> keep this thread going on.

Yeah, it would probably great to present the problem and the exact 
constraints we have (e.g., things stupid me asks above regarding actual 
sizes in which we can perform reads and writes), so we can discuss 
possible solutions.

@David R., is the slot in two weeks already taken?
Chris Li Jan. 8, 2025, 9:05 p.m. UTC | #6
On Tue, Jan 7, 2025 at 4:29 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>
> On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> > On 07.01.25 10:43, Daniel Gomez wrote:
> > > Hi,
> >
> > Hi,
> >
> > >
> > > High-capacity SSDs require writes to be aligned with the drive's
> > > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> > > support swap on these devices, we need to ensure that writes do not
> > > cross IU boundaries. So, I think this may require increasing the minimum
> > > allocation size for swap users.
> >
> > How would we handle swapout/swapin when we have smaller pages (just imagine
> > someone does a mmap(4KiB))?
>
> Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> potential RMW penalty. So, I think aligning the mmap allocation to the
> IU would guarantee a write of the required granularity and alignment.
> But let's also look at your suggestion below with swapcache.

I think only the writer needs to be grouped by IU size. Ideally the
swap front end doesn't have to know about the IU size. There are many
reasons forcing the swap entry size on the swap cache would be tricky.
e.g. If the folio is 4K, it is tricky to force it to be 16K. Only 1 4K
page is cold, the other nearby page is hot. etc etc.

>
> Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> the same write penalty implications, and only affecting performance
> if I/Os are not conformant to these boundaries. So, reading at IU
> boundaries is preferred to get optimal performance, not a 'requirement'.
>
> >
> > Could this be something that gets abstracted/handled by the swap
> > implementation? (i.e., multiple small folios get added to the swapcache but
> > get written out / read in as a single unit?).

Yes.

>
> Do you mean merging like in the block layer? I'm not entirely sure if
> this could guarantee deterministically the I/O boundaries the same way
> it does min order large folio allocations in the page cache. But I guess
> is worth exploring as optimization.
>
> >
> > I recall that we have been talking about a better swap abstraction for years
> > :)
>
> Adding Chris Li to the cc list in case he has more input.

Sorry I'm a bit late to the party. Yes I do have some ideas I want to
propose on the LSF/MM as topics, maybe early next week.
Here are some highlights of it.

I think we need to have a separation of the swap cache and the backing
of IO of the swap file. I call it the "virtual swapfile".
It is virtual in two aspect:
1) There is an up front size at swap on, but no up front allocation of
the vmalloc array. The array grows as needed.
2) There is a virtual to physical swap entry mapping. The cost is 4
bytes per swap entry. But it will solve a lot of problems all
together.

IU size write grouping would be a good user of this virtual layer.
Another usage case if we want to write a compressed zswap/zram entry
into the SSD,
we might actually encounter the size problem in another direction.
e.g. writing swap entries smaller than 4K.

I am still working on the write up. More details will come.

Chris

>
> >
> > Might be a good topic for LSF/MM (might or might not be a better place than
> > the MM alignment session).
>
> Both options work for me. LSF/MM is in 12 weeks so, having a previous
> session would be great.
>
> Daniel
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
Chris Li Jan. 8, 2025, 9:09 p.m. UTC | #7
On Tue, Jan 7, 2025 at 8:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.01.25 13:29, Daniel Gomez wrote:
> > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> >> On 07.01.25 10:43, Daniel Gomez wrote:
> >>> Hi,
> >>
> >> Hi,
> >>
> >>>
> >>> High-capacity SSDs require writes to be aligned with the drive's
> >>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> >>> support swap on these devices, we need to ensure that writes do not
> >>> cross IU boundaries. So, I think this may require increasing the minimum
> >>> allocation size for swap users.
> >>
> >> How would we handle swapout/swapin when we have smaller pages (just imagine
> >> someone does a mmap(4KiB))?
> >
> > Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> > potential RMW penalty. So, I think aligning the mmap allocation to the
> > IU would guarantee a write of the required granularity and alignment.
>
> We must be prepared to handle and VMA layout with single-page VMAs,
> single-page holes etc ... :/ IMHO we should try to handle this
> transparently to the application.
>
> > But let's also look at your suggestion below with swapcache.
> >
> > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> > the same write penalty implications, and only affecting performance
> > if I/Os are not conformant to these boundaries. So, reading at IU
> > boundaries is preferred to get optimal performance, not a 'requirement'.
> >
> >>
> >> Could this be something that gets abstracted/handled by the swap
> >> implementation? (i.e., multiple small folios get added to the swapcache but
> >> get written out / read in as a single unit?).
> >
> > Do you mean merging like in the block layer? I'm not entirely sure if
> > this could guarantee deterministically the I/O boundaries the same way
> > it does min order large folio allocations in the page cache. But I guess
> > is worth exploring as optimization.
>
> Maybe the swapcache could somehow abstract that? We currently have the
> swap slot allocator, that assigns slots to pages.
>
> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various
> options to explore.
>
> For example, we could size swap slots 16 KiB, and assign even 4 KiB
> pages a single slot. This would waste swap space with small folios, that
> would go away with large folios.

We can group multiple swap 4K swap entries into one 16K write unit.
There will be no waste of the SSD.

>
> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
> effectively writeback "everything" residing in the relevant swap slots
> that span a BS?
>
> I recall there was a discussion about atomic writes involving multiple
> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely
> no expert on that, unfortunately. Hoping Chris has some ideas.

Yes, see my other email about the "virtual swapfile" idea. More
detailed write up coming next week.

Chris

>
>
> >
> >>
> >> I recall that we have been talking about a better swap abstraction for years
> >> :)
> >
> > Adding Chris Li to the cc list in case he has more input.
> >
> >>
> >> Might be a good topic for LSF/MM (might or might not be a better place than
> >> the MM alignment session).
> >
> > Both options work for me. LSF/MM is in 12 weeks so, having a previous
> > session would be great.
>
> Both work for me.
>
> --
> Cheers,
>
> David / dhildenb
>
Chris Li Jan. 8, 2025, 9:19 p.m. UTC | #8
On Wed, Jan 8, 2025 at 12:36 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> Maybe the swapcache could somehow abstract that? We currently have the swap
> >> slot allocator, that assigns slots to pages.
> >>
> >> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options
> >> to explore.
> >>
> >> For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a
> >> single slot. This would waste swap space with small folios, that would go
> >> away with large folios.
> >
> > So batching order-0 folios in bigger slots that match the FS BS (e.g. 16
> > KiB) to perform disk writes, right?
>
> Batching might be one idea, but the first idea I raised here would be
> that the swap slot size will match the BS (e.g., 16 KiB) and contain at
> most one folio.
>
> So a order-0 folio would get a single slot assigned and effectively
> "waste" 12 KiB of disk space.

I prefer not to "waste" that. It will be wasted on the write
amplification as well.

>
> An order-2 folio would get a single slot assigned and not waste any memory.
>
> An order-3 folio would get two slots assigned etc. (similar to how it is
> done today for non-order-0 folios)
>
> So the penalty for using small folios would be more wasted disk space on
> such devices.
>
> Can we also assign different orders
> > to the same slot?
>
> I guess yes.
>
> And can we batch folios while keeping alignment to the
> > BS (IU)?
>
> I assume with "batching" you would mean that we could actually have
> multiple folios inside a single BS, like up to 4 order-0 folios in a
> single 16 KiB block? That might be one way of doing it, although I
> suspect this can get a bit complicated.

That would be my preference. BTW, another usage case is that if we
want to write compressed swap entries into the SSD (to reduce the wear
on SSD), we will also end up with a similar situation where we want to
combine multiple swap entries into a write unit.

>
> IIUC, we can perform 4 KiB read/write, but we must only have a single
> write per block, because otherwise we might get the RMW problems,
> correct? Then, maybe a mechanism to guarantee that only a single swap
> writeback within a BS can happen at one point in time might also be an
> alternative.

Yes, I do see that batching and grouping write of the swap entries is
necessary and useful.

>
> >
> >>
> >> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
> >> effectively writeback "everything" residing in the relevant swap slots that
> >> span a BS?
> >>
> >> I recall there was a discussion about atomic writes involving multiple
> >> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely no
> >> expert on that, unfortunately. Hoping Chris has some ideas.
> >
> > Not sure about the discussion but I guess the main concern for atomic
> > and swaping is the alignment and the questions I raised above.
>
> Yes, I think that's similar.

Agree, it is very much similar. It can share a single solution, the
"virtual swapfile". That is my proposal.

>
> >
> >>
> >>
> >>>
> >>>>
> >>>> I recall that we have been talking about a better swap abstraction for years
> >>>> :)
> >>>
> >>> Adding Chris Li to the cc list in case he has more input.
> >>>
> >>>>
> >>>> Might be a good topic for LSF/MM (might or might not be a better place than
> >>>> the MM alignment session).
> >>>
> >>> Both options work for me. LSF/MM is in 12 weeks so, having a previous
> >>> session would be great.
> >>
> >> Both work for me.
> >
> > Can we start by scheduling this topic for the next available MM session?
> > Would be great to get initial feedback/thoughts/concerns, etc while we
> > keep this thread going on.
>
> Yeah, it would probably great to present the problem and the exact
> constraints we have (e.g., things stupid me asks above regarding actual
> sizes in which we can perform reads and writes), so we can discuss
> possible solutions.
>
> @David R., is the slot in two weeks already taken?

Hopefully I can send out the "virtual swapfile" proposal in time and
we can discuss that as one of the possible approaches.

Chris

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Jan. 8, 2025, 9:24 p.m. UTC | #9
On 08.01.25 22:19, Chris Li wrote:
> On Wed, Jan 8, 2025 at 12:36 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Maybe the swapcache could somehow abstract that? We currently have the swap
>>>> slot allocator, that assigns slots to pages.
>>>>
>>>> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options
>>>> to explore.
>>>>
>>>> For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a
>>>> single slot. This would waste swap space with small folios, that would go
>>>> away with large folios.
>>>
>>> So batching order-0 folios in bigger slots that match the FS BS (e.g. 16
>>> KiB) to perform disk writes, right?
>>
>> Batching might be one idea, but the first idea I raised here would be
>> that the swap slot size will match the BS (e.g., 16 KiB) and contain at
>> most one folio.
>>
>> So a order-0 folio would get a single slot assigned and effectively
>> "waste" 12 KiB of disk space.
> 
> I prefer not to "waste" that. It will be wasted on the write
> amplification as well.

If it can be implemented fairly easily, sure! :)

Looking forward to hearing about the proposal!
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0a9071cfe1d..80a9dbe9645a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3128,6 +3128,7 @@  static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
                si->flags |= SWP_BLKDEV;
        } else if (S_ISREG(inode->i_mode)) {
                si->bdev = inode->i_sb->s_bdev;
+               si->flags |= SWP_BLKDEV;
        }

        return 0;