mbox series

[v3,0/2] mm: store zero pages to be swapped out in a bitmap

Message ID 20240610121820.328876-1-usamaarif642@gmail.com (mailing list archive)
Headers show
Series mm: store zero pages to be swapped out in a bitmap | expand

Message

Usama Arif June 10, 2024, 12:15 p.m. UTC
Going back to the v1 implementation of the patchseries. The main reason
is that a correct version of v2 implementation requires another rmap
walk in shrink_folio_list to change the ptes from swap entry to zero pages to
work (i.e. more CPU used) [1], is more complex to implement compared to v1
and is harder to verify correctness compared to v1, where everything is
handled by swap.

---
As shown in the patchseries that introduced the zswap same-filled
optimization [2], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet.
By using VM counters in swap_writepage (not included in this
patchseries) it was found that less than 1% of the same-filled
pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [3] where zswap would only track
zero-filled pages instead of same-filled.
This patchseries adds zero-filled pages optimization to swap
(hence it can be used even if zswap is disabled) and removes the
same-filled code from zswap (as only 1% of the same-filled pages are
non-zero), simplifying code.

This patchseries is based on mm-unstable.


[1] https://lore.kernel.org/all/e4d167fe-cb1e-41d1-a144-00bfa14b7148@gmail.com/
[2] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[3] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/

---
v2->v3:
- Going back to the v1 version of the implementation (David and Shakeel)
- convert unatomic bitmap_set/clear to atomic set/clear_bit (Johannes)
- use clear_highpage instead of folio_page_zero_fill (Yosry)

v1 -> v2:
- instead of using a bitmap in swap, clear pte for zero pages and let
  do_pte_missing handle this page at page fault. (Yosry and Matthew)
- Check end of page first when checking if folio is zero filled as
  it could lead to better performance. (Yosry)
 
Usama Arif (2):
  mm: store zero pages to be swapped out in a bitmap
  mm: remove code to handle same filled pages

 include/linux/swap.h |  1 +
 mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
 mm/swapfile.c        | 21 +++++++++-
 mm/zswap.c           | 86 ++++-------------------------------------
 4 files changed, 119 insertions(+), 81 deletions(-)

Comments

Yosry Ahmed June 13, 2024, 9:21 p.m. UTC | #1
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Going back to the v1 implementation of the patchseries. The main reason
> is that a correct version of v2 implementation requires another rmap
> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> and is harder to verify correctness compared to v1, where everything is
> handled by swap.
>
> ---
> As shown in the patchseries that introduced the zswap same-filled
> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> This is also observed across Meta's server fleet.
> By using VM counters in swap_writepage (not included in this
> patchseries) it was found that less than 1% of the same-filled
> pages to be swapped out are non-zero pages.
>
> For conventional swap setup (without zswap), rather than reading/writing
> these pages to flash resulting in increased I/O and flash wear, a bitmap
> can be used to mark these pages as zero at write time, and the pages can
> be filled at read time if the bit corresponding to the page is set.
>
> When using zswap with swap, this also means that a zswap_entry does not
> need to be allocated for zero filled pages resulting in memory savings
> which would offset the memory used for the bitmap.
>
> A similar attempt was made earlier in [3] where zswap would only track
> zero-filled pages instead of same-filled.
> This patchseries adds zero-filled pages optimization to swap
> (hence it can be used even if zswap is disabled) and removes the
> same-filled code from zswap (as only 1% of the same-filled pages are
> non-zero), simplifying code.
>
> This patchseries is based on mm-unstable.

Aside from saving swap/zswap space and simplifying the zswap code
(thanks for that!), did you observe any performance benefits from not
having to go into zswap code for zero-filled pages?

In [3], I observed ~1.5% improvement in kernbench just by optimizing
zswap's handling of zero-filled pages, and that benchmark only
produced around 1.5% zero-filled pages. I imagine avoiding the zswap
code entirely, and for workloads that have 10-20% zero-filled pages,
the performance improvement should be more pronounced.

When zswap is not being used and all swap activity translates to IO, I
imagine the benefits will be much more significant.

I am curious if you have any numbers with or without zswap :)
Yosry Ahmed June 13, 2024, 9:50 p.m. UTC | #2
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Going back to the v1 implementation of the patchseries. The main reason
> is that a correct version of v2 implementation requires another rmap
> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> and is harder to verify correctness compared to v1, where everything is
> handled by swap.
>
> ---
> As shown in the patchseries that introduced the zswap same-filled
> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> This is also observed across Meta's server fleet.
> By using VM counters in swap_writepage (not included in this
> patchseries) it was found that less than 1% of the same-filled
> pages to be swapped out are non-zero pages.
>
> For conventional swap setup (without zswap), rather than reading/writing
> these pages to flash resulting in increased I/O and flash wear, a bitmap
> can be used to mark these pages as zero at write time, and the pages can
> be filled at read time if the bit corresponding to the page is set.
>
> When using zswap with swap, this also means that a zswap_entry does not
> need to be allocated for zero filled pages resulting in memory savings
> which would offset the memory used for the bitmap.
>
> A similar attempt was made earlier in [3] where zswap would only track
> zero-filled pages instead of same-filled.
> This patchseries adds zero-filled pages optimization to swap
> (hence it can be used even if zswap is disabled) and removes the
> same-filled code from zswap (as only 1% of the same-filled pages are
> non-zero), simplifying code.

There is also code to handle same-filled pages in zram, should we
remove this as well? It is worth noting that the handling in zram was
initially for zero-filled pages only, but it was extended to cover
same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
pages to same element pages"). Apparently in a test on Android, about
2.5% of the swapped out pages were non-zero same-filled pages.

However, the leap from handling zero-filled pages to handling all
same-filled pages in zram wasn't a stretch. But now that zero-filled
pages handling in zram is redundant with this series, I wonder if it's
still worth keeping the same-filled pages handling.

Adding Minchan and Sergey here.

>
> This patchseries is based on mm-unstable.
>
>
> [1] https://lore.kernel.org/all/e4d167fe-cb1e-41d1-a144-00bfa14b7148@gmail.com/
> [2] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> [3] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/
>
> ---
> v2->v3:
> - Going back to the v1 version of the implementation (David and Shakeel)
> - convert unatomic bitmap_set/clear to atomic set/clear_bit (Johannes)
> - use clear_highpage instead of folio_page_zero_fill (Yosry)
>
> v1 -> v2:
> - instead of using a bitmap in swap, clear pte for zero pages and let
>   do_pte_missing handle this page at page fault. (Yosry and Matthew)
> - Check end of page first when checking if folio is zero filled as
>   it could lead to better performance. (Yosry)
>
> Usama Arif (2):
>   mm: store zero pages to be swapped out in a bitmap
>   mm: remove code to handle same filled pages
>
>  include/linux/swap.h |  1 +
>  mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/swapfile.c        | 21 +++++++++-
>  mm/zswap.c           | 86 ++++-------------------------------------
>  4 files changed, 119 insertions(+), 81 deletions(-)
>
> --
> 2.43.0
>
Shakeel Butt June 13, 2024, 10:41 p.m. UTC | #3
On Thu, Jun 13, 2024 at 02:50:31PM GMT, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> > Going back to the v1 implementation of the patchseries. The main reason
> > is that a correct version of v2 implementation requires another rmap
> > walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> > work (i.e. more CPU used) [1], is more complex to implement compared to v1
> > and is harder to verify correctness compared to v1, where everything is
> > handled by swap.
> >
> > ---
> > As shown in the patchseries that introduced the zswap same-filled
> > optimization [2], 10-20% of the pages stored in zswap are same-filled.
> > This is also observed across Meta's server fleet.
> > By using VM counters in swap_writepage (not included in this
> > patchseries) it was found that less than 1% of the same-filled
> > pages to be swapped out are non-zero pages.
> >
> > For conventional swap setup (without zswap), rather than reading/writing
> > these pages to flash resulting in increased I/O and flash wear, a bitmap
> > can be used to mark these pages as zero at write time, and the pages can
> > be filled at read time if the bit corresponding to the page is set.
> >
> > When using zswap with swap, this also means that a zswap_entry does not
> > need to be allocated for zero filled pages resulting in memory savings
> > which would offset the memory used for the bitmap.
> >
> > A similar attempt was made earlier in [3] where zswap would only track
> > zero-filled pages instead of same-filled.
> > This patchseries adds zero-filled pages optimization to swap
> > (hence it can be used even if zswap is disabled) and removes the
> > same-filled code from zswap (as only 1% of the same-filled pages are
> > non-zero), simplifying code.
> 
> There is also code to handle same-filled pages in zram, should we
> remove this as well? It is worth noting that the handling in zram was
> initially for zero-filled pages only, but it was extended to cover
> same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
> pages to same element pages"). Apparently in a test on Android, about
> 2.5% of the swapped out pages were non-zero same-filled pages.
> 
> However, the leap from handling zero-filled pages to handling all
> same-filled pages in zram wasn't a stretch. But now that zero-filled
> pages handling in zram is redundant with this series, I wonder if it's
> still worth keeping the same-filled pages handling.

Please correct me if I am wrong but zram same-filled page handling is
not just limited to swap-on-zram use-case and any zram as block device
user can benefit from it. Also zram might not see any simplification
similar to zswap in this patch series. I would say motivation behind
zswap changes seems quite different from possible zram changes. I would
recommed to evaluate these cases independently.
Yosry Ahmed June 13, 2024, 10:59 p.m. UTC | #4
On Thu, Jun 13, 2024 at 3:41 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Jun 13, 2024 at 02:50:31PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >
> > > Going back to the v1 implementation of the patchseries. The main reason
> > > is that a correct version of v2 implementation requires another rmap
> > > walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> > > work (i.e. more CPU used) [1], is more complex to implement compared to v1
> > > and is harder to verify correctness compared to v1, where everything is
> > > handled by swap.
> > >
> > > ---
> > > As shown in the patchseries that introduced the zswap same-filled
> > > optimization [2], 10-20% of the pages stored in zswap are same-filled.
> > > This is also observed across Meta's server fleet.
> > > By using VM counters in swap_writepage (not included in this
> > > patchseries) it was found that less than 1% of the same-filled
> > > pages to be swapped out are non-zero pages.
> > >
> > > For conventional swap setup (without zswap), rather than reading/writing
> > > these pages to flash resulting in increased I/O and flash wear, a bitmap
> > > can be used to mark these pages as zero at write time, and the pages can
> > > be filled at read time if the bit corresponding to the page is set.
> > >
> > > When using zswap with swap, this also means that a zswap_entry does not
> > > need to be allocated for zero filled pages resulting in memory savings
> > > which would offset the memory used for the bitmap.
> > >
> > > A similar attempt was made earlier in [3] where zswap would only track
> > > zero-filled pages instead of same-filled.
> > > This patchseries adds zero-filled pages optimization to swap
> > > (hence it can be used even if zswap is disabled) and removes the
> > > same-filled code from zswap (as only 1% of the same-filled pages are
> > > non-zero), simplifying code.
> >
> > There is also code to handle same-filled pages in zram, should we
> > remove this as well? It is worth noting that the handling in zram was
> > initially for zero-filled pages only, but it was extended to cover
> > same-filled pages as well by commit 8e19d540d107 ("zram: extend zero
> > pages to same element pages"). Apparently in a test on Android, about
> > 2.5% of the swapped out pages were non-zero same-filled pages.
> >
> > However, the leap from handling zero-filled pages to handling all
> > same-filled pages in zram wasn't a stretch. But now that zero-filled
> > pages handling in zram is redundant with this series, I wonder if it's
> > still worth keeping the same-filled pages handling.
>
> Please correct me if I am wrong but zram same-filled page handling is
> not just limited to swap-on-zram use-case and any zram as block device
> user can benefit from it. Also zram might not see any simplification
> similar to zswap in this patch series. I would say motivation behind
> zswap changes seems quite different from possible zram changes. I would
> recommed to evaluate these cases independently.

Uh yes. I keep forgetting that zram is used for other use cases than
swap. Please dismiss my comments then (unless it's uncommon to have
zero-filled / same-filled pages in other use cases).
Usama Arif June 14, 2024, 9:22 a.m. UTC | #5
On 13/06/2024 22:21, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> Going back to the v1 implementation of the patchseries. The main reason
>> is that a correct version of v2 implementation requires another rmap
>> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
>> work (i.e. more CPU used) [1], is more complex to implement compared to v1
>> and is harder to verify correctness compared to v1, where everything is
>> handled by swap.
>>
>> ---
>> As shown in the patchseries that introduced the zswap same-filled
>> optimization [2], 10-20% of the pages stored in zswap are same-filled.
>> This is also observed across Meta's server fleet.
>> By using VM counters in swap_writepage (not included in this
>> patchseries) it was found that less than 1% of the same-filled
>> pages to be swapped out are non-zero pages.
>>
>> For conventional swap setup (without zswap), rather than reading/writing
>> these pages to flash resulting in increased I/O and flash wear, a bitmap
>> can be used to mark these pages as zero at write time, and the pages can
>> be filled at read time if the bit corresponding to the page is set.
>>
>> When using zswap with swap, this also means that a zswap_entry does not
>> need to be allocated for zero filled pages resulting in memory savings
>> which would offset the memory used for the bitmap.
>>
>> A similar attempt was made earlier in [3] where zswap would only track
>> zero-filled pages instead of same-filled.
>> This patchseries adds zero-filled pages optimization to swap
>> (hence it can be used even if zswap is disabled) and removes the
>> same-filled code from zswap (as only 1% of the same-filled pages are
>> non-zero), simplifying code.
>>
>> This patchseries is based on mm-unstable.
> Aside from saving swap/zswap space and simplifying the zswap code
> (thanks for that!), did you observe any performance benefits from not
> having to go into zswap code for zero-filled pages?
>
> In [3], I observed ~1.5% improvement in kernbench just by optimizing
> zswap's handling of zero-filled pages, and that benchmark only
> produced around 1.5% zero-filled pages. I imagine avoiding the zswap
> code entirely, and for workloads that have 10-20% zero-filled pages,
> the performance improvement should be more pronounced.
>
> When zswap is not being used and all swap activity translates to IO, I
> imagine the benefits will be much more significant.
>
> I am curious if you have any numbers with or without zswap :)

Apart from tracking zero-filled pages (using inaccurate counters not in 
this series) which had the same pattern to zswap_same_filled_pages, the 
nvme writes went down around 5-10% during stable points in the 
production experiment. The performance improved by 2-3% at some points, 
but this is comparing 2 sets of machines running production workloads 
(which can vary between machine sets), so I would take those numbers 
cautiously and which is why I didnt include them in the cover letter.
Yosry Ahmed June 14, 2024, 9:28 a.m. UTC | #6
On Fri, Jun 14, 2024 at 2:22 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 13/06/2024 22:21, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >> Going back to the v1 implementation of the patchseries. The main reason
> >> is that a correct version of v2 implementation requires another rmap
> >> walk in shrink_folio_list to change the ptes from swap entry to zero pages to
> >> work (i.e. more CPU used) [1], is more complex to implement compared to v1
> >> and is harder to verify correctness compared to v1, where everything is
> >> handled by swap.
> >>
> >> ---
> >> As shown in the patchseries that introduced the zswap same-filled
> >> optimization [2], 10-20% of the pages stored in zswap are same-filled.
> >> This is also observed across Meta's server fleet.
> >> By using VM counters in swap_writepage (not included in this
> >> patchseries) it was found that less than 1% of the same-filled
> >> pages to be swapped out are non-zero pages.
> >>
> >> For conventional swap setup (without zswap), rather than reading/writing
> >> these pages to flash resulting in increased I/O and flash wear, a bitmap
> >> can be used to mark these pages as zero at write time, and the pages can
> >> be filled at read time if the bit corresponding to the page is set.
> >>
> >> When using zswap with swap, this also means that a zswap_entry does not
> >> need to be allocated for zero filled pages resulting in memory savings
> >> which would offset the memory used for the bitmap.
> >>
> >> A similar attempt was made earlier in [3] where zswap would only track
> >> zero-filled pages instead of same-filled.
> >> This patchseries adds zero-filled pages optimization to swap
> >> (hence it can be used even if zswap is disabled) and removes the
> >> same-filled code from zswap (as only 1% of the same-filled pages are
> >> non-zero), simplifying code.
> >>
> >> This patchseries is based on mm-unstable.
> > Aside from saving swap/zswap space and simplifying the zswap code
> > (thanks for that!), did you observe any performance benefits from not
> > having to go into zswap code for zero-filled pages?
> >
> > In [3], I observed ~1.5% improvement in kernbench just by optimizing
> > zswap's handling of zero-filled pages, and that benchmark only
> > produced around 1.5% zero-filled pages. I imagine avoiding the zswap
> > code entirely, and for workloads that have 10-20% zero-filled pages,
> > the performance improvement should be more pronounced.
> >
> > When zswap is not being used and all swap activity translates to IO, I
> > imagine the benefits will be much more significant.
> >
> > I am curious if you have any numbers with or without zswap :)
>
> Apart from tracking zero-filled pages (using inaccurate counters not in
> this series) which had the same pattern to zswap_same_filled_pages, the
> nvme writes went down around 5-10% during stable points in the
> production experiment. The performance improved by 2-3% at some points,
> but this is comparing 2 sets of machines running production workloads
> (which can vary between machine sets), so I would take those numbers
> cautiously and which is why I didnt include them in the cover letter.
>

Yeah this makes sense, thanks. It would have been great if we had
comparable numbers with and without this series. But this shouldn't be
a big deal, the advantage of the series should be self-explanatory.
It's just a shame you don't get to brag about it :)