diff mbox series

[v3,1/2] mm: store zero pages to be swapped out in a bitmap

Message ID 20240610121820.328876-2-usamaarif642@gmail.com (mailing list archive)
State New
Headers show
Series mm: store zero pages to be swapped out in a bitmap | expand

Commit Message

Usama Arif June 10, 2024, 12:15 p.m. UTC
Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/swap.h |  1 +
 mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
 mm/swapfile.c        | 21 +++++++++-
 3 files changed, 111 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox June 10, 2024, 1:07 p.m. UTC | #1
On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote:
> +	if (is_folio_zero_filled(folio)) {
> +		swap_zeromap_folio_set(folio);
> +		folio_start_writeback(folio);
> +		folio_unlock(folio);
> +		folio_end_writeback(folio);

What's the point?  As far as I can see, the only thing this is going to
do is spend a lot of time messing with various counters only to end up
with incrementing NR_WRITTEN, which is wrong because you didn't actually
write it.
Usama Arif June 10, 2024, 1:56 p.m. UTC | #2
On 10/06/2024 14:07, Matthew Wilcox wrote:
> On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote:
>> +	if (is_folio_zero_filled(folio)) {
>> +		swap_zeromap_folio_set(folio);
>> +		folio_start_writeback(folio);
>> +		folio_unlock(folio);
>> +		folio_end_writeback(folio);
> What's the point?  As far as I can see, the only thing this is going to
> do is spend a lot of time messing with various counters only to end up
> with incrementing NR_WRITTEN, which is wrong because you didn't actually
> write it.
>

I am guessing what you are suggesting is just do this?

     if (is_folio_zero_filled(folio)) {
         swap_zeromap_folio_set(folio);
         folio_unlock(folio);
         return 0;
     }

This is what I did initially while developing this, but when I started 
looking into why zswap_store does  folio_start_writeback, folio_unlock, 
folio_end_writeback I found:

https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336

"If no I/O is submitted, the filesystem must run end_page_writeback() 
against the page before returning from writepage."

and

https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L349

"Note, failure to run either redirty_page_for_writepage() or the 
combination of
set_page_writeback()/end_page_writeback() on a page submitted to writepage
will leave the page itself marked clean but it will be tagged as dirty 
in the
radix tree.  This incoherency can lead to all sorts of hard-to-debug 
problems
in the filesystem like having dirty inodes at umount and losing written 
data.
"

If we have zswap enabled, the zero filled pages (infact any page that is 
compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly 
incremented. So the behaviour for NR_WRITTEN does not change in this 
patch when encountering zero pages with zswap enabled (even if its wrong).

This patch just extracts the optimization out from zswap [1] to swap, so 
that it always runs.

In order to fix NR_WRITTEN accounting for zswap, this patch series and 
any other cases where no I/O is submitted but end_page_writeback is 
called before returning to writepage, maybe we could add an argument to 
__folio_end_writeback like below? There are a lot of calls to 
folio_end_writeback and the behaviour of zeropage with regards to 
NR_WRITTEN doesnt change with or without this patchseries with zswap 
enabled, so maybe we could keep this independent of this series? I would 
be happy to submit this as separate patch if it makes sense.


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81b2e4128d26..415037f511c2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -3042,7 +3042,7 @@ static void wb_inode_writeback_end(struct 
bdi_writeback *wb)
         spin_unlock_irqrestore(&wb->work_lock, flags);
  }

-bool __folio_end_writeback(struct folio *folio)
+bool __folio_end_writeback(struct folio *folio, bool nr_written_increment)
  {
         long nr = folio_nr_pages(folio);
         struct address_space *mapping = folio_mapping(folio);
@@ -3078,7 +3078,8 @@ bool __folio_end_writeback(struct folio *folio)

         lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr);
         zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
-       node_stat_mod_folio(folio, NR_WRITTEN, nr);
+       if (nr_written_increment)
+               node_stat_mod_folio(folio, NR_WRITTEN, nr);
         folio_memcg_unlock(folio);

         return ret;


[1] 
https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
Matthew Wilcox June 10, 2024, 2:06 p.m. UTC | #3
On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
> I am guessing what you are suggesting is just do this?
> 
>     if (is_folio_zero_filled(folio)) {
>         swap_zeromap_folio_set(folio);
>         folio_unlock(folio);
>         return 0;
>     }

Right.

> This is what I did initially while developing this, but when I started
> looking into why zswap_store does  folio_start_writeback, folio_unlock,
> folio_end_writeback I found:
> 
> https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336
> 
> "If no I/O is submitted, the filesystem must run end_page_writeback()
> against the page before returning from writepage."

But that's advice to filesystem authors.  File pages don't come this
way; we only put anonyous pages into swap (except tmpfs).

> If we have zswap enabled, the zero filled pages (infact any page that is
> compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly
> incremented. So the behaviour for NR_WRITTEN does not change in this patch
> when encountering zero pages with zswap enabled (even if its wrong).

We should fiz zswap too.

> In order to fix NR_WRITTEN accounting for zswap, this patch series and any
> other cases where no I/O is submitted but end_page_writeback is called
> before returning to writepage, maybe we could add an argument to
> __folio_end_writeback like below? There are a lot of calls to
> folio_end_writeback and the behaviour of zeropage with regards to NR_WRITTEN
> doesnt change with or without this patchseries with zswap enabled, so maybe
> we could keep this independent of this series? I would be happy to submit
> this as separate patch if it makes sense.

It makes no sense at all.
Usama Arif June 10, 2024, 2:14 p.m. UTC | #4
On 10/06/2024 15:06, Matthew Wilcox wrote:
> On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
>> I am guessing what you are suggesting is just do this?
>>
>>      if (is_folio_zero_filled(folio)) {
>>          swap_zeromap_folio_set(folio);
>>          folio_unlock(folio);
>>          return 0;
>>      }
> Right.

Thanks! Will change to this in the next revision.

>> If we have zswap enabled, the zero filled pages (infact any page that is
>> compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly
>> incremented. So the behaviour for NR_WRITTEN does not change in this patch
>> when encountering zero pages with zswap enabled (even if its wrong).
> We should fiz zswap too.
>
Will send the below diff as a separate patch for zswap:

diff --git a/mm/page_io.c b/mm/page_io.c index 
2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c 
@@ -281,9 +281,7 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc) } swap_zeromap_folio_clear(folio); if 
(zswap_store(folio)) { - folio_start_writeback(folio); 
folio_unlock(folio); - folio_end_writeback(folio); return 0; }
Usama Arif June 10, 2024, 2:33 p.m. UTC | #5
On 10/06/2024 15:14, Usama Arif wrote:
>
> On 10/06/2024 15:06, Matthew Wilcox wrote:
>> On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote:
>>> I am guessing what you are suggesting is just do this?
>>>
>>>      if (is_folio_zero_filled(folio)) {
>>>          swap_zeromap_folio_set(folio);
>>>          folio_unlock(folio);
>>>          return 0;
>>>      }
>> Right.
>
> Thanks! Will change to this in the next revision.
>
>>> If we have zswap enabled, the zero filled pages (infact any page 
>>> that is
>>> compressed), will be saved in zswap_entry and NR_WRITTEN will be 
>>> wrongly
>>> incremented. So the behaviour for NR_WRITTEN does not change in this 
>>> patch
>>> when encountering zero pages with zswap enabled (even if its wrong).
>> We should fiz zswap too.
>>
> Will send the below diff as a separate patch for zswap:
>
> diff --git a/mm/page_io.c b/mm/page_io.c index 
> 2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++ 
> b/mm/page_io.c @@ -281,9 +281,7 @@ int swap_writepage(struct page 
> *page, struct writeback_control *wbc) } 
> swap_zeromap_folio_clear(folio); if (zswap_store(folio)) { - 
> folio_start_writeback(folio); folio_unlock(folio); - 
> folio_end_writeback(folio); return 0; }
>

My mail client seems to have messed up the diff, but have sent the patch 
here 
(https://lore.kernel.org/all/20240610143037.812955-1-usamaarif642@gmail.com/). 
Tested with test_zswap kselftest.
Yosry Ahmed June 10, 2024, 5:57 p.m. UTC | #6
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/swap.h |  1 +
>  mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/swapfile.c        | 21 +++++++++-
>  3 files changed, 111 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
>         signed char     type;           /* strange name for an index */
>         unsigned int    max;            /* extent of the swap_map */
>         unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>         struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>         struct swap_cluster_list free_clusters; /* free clusters list */
>         unsigned int lowest_bit;        /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..2cac1e11fb85 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>         goto out;
>  }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> +       unsigned long *data;
> +       unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> +       bool ret = false;
> +
> +       data = kmap_local_folio(folio, i * PAGE_SIZE);
> +       if (data[last_pos])
> +               goto out;
> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> +               if (data[pos])
> +                       goto out;
> +       }
> +       ret = true;
> +out:
> +       kunmap_local(data);
> +       return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               if (!is_folio_page_zero_filled(folio, i))
> +                       return false;
> +       }
> +       return true;
> +}

Is there any benefit to iterating on the folio in pages (i.e. have
both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
don't we just kmap the entire folio and check it all at once?

> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++)
> +               clear_highpage(folio_page(folio, i));
> +}
> +
> +static void swap_zeromap_folio_set(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               set_bit(swp_offset(entry), sis->zeromap);
> +       }
> +}
> +
> +static void swap_zeromap_folio_clear(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               clear_bit(swp_offset(entry), sis->zeromap);
> +       }
> +}
> +
> +static bool swap_zeromap_folio_test(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               if (!test_bit(swp_offset(entry), sis->zeromap))
> +                       return false;
> +       }
> +       return true;
> +}
> +
>  /*
>   * We may have stale swap cache pages in memory: notice
>   * them here and get rid of the unnecessary final write.
> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 folio_unlock(folio);
>                 return ret;
>         }
> +
> +       if (is_folio_zero_filled(folio)) {
> +               swap_zeromap_folio_set(folio);
> +               folio_start_writeback(folio);
> +               folio_unlock(folio);
> +               folio_end_writeback(folio);
> +               return 0;
> +       }
> +       swap_zeromap_folio_clear(folio);
>         if (zswap_store(folio)) {
>                 folio_start_writeback(folio);
>                 folio_unlock(folio);
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>                 psi_memstall_enter(&pflags);
>         }
>         delayacct_swapin_start();
> -
> -       if (zswap_load(folio)) {
> +       if (swap_zeromap_folio_test(folio)) {
> +               folio_zero_fill(folio);
> +               folio_mark_uptodate(folio);
> +               folio_unlock(folio);

We don't currently support swapping in large folios, but it is a work
in progress, and this will break once we have it.
swap_zeromap_folio_test() will return false even if parts of the folio
are in fact zero-filled. Then, we will go read those from disk swap,
essentially corrupting data.

The same problem can happen for zswap, if a large folio being swapped
is only partially in zswap. In both cases, it's really easy to miss
the problem if we're testing with zswap disabled, with incompressible
data, or with non-zero data. Silent data corruption is not very
debuggable.

I proposed adding handling for this case in zswap here:
https://lore.kernel.org/lkml/20240608023654.3513385-1-yosryahmed@google.com/

The discussions there hadn't settled yet, but depending on how it pans
out I suspect we will want something similar for the zeromap case as
well.

> +       } else if (zswap_load(folio)) {
>                 folio_mark_uptodate(folio);
>                 folio_unlock(folio);
>         } else if (data_race(sis->flags & SWP_FS_OPS)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f1e559e216bd..90451174fe34 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>                 unsigned int idx)
>  {
> +       unsigned int i;
> +
>         /*
>          * If scan_swap_map_slots() can't find a free cluster, it will check
>          * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>          */
>         memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>                         SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> +       /*
> +        * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
> +        * call on other slots, hence use atomic clear_bit for zeromap instead of the
> +        * non-atomic bitmap_clear.
> +        */
> +       for (i = 0; i < SWAPFILE_CLUSTER; i++)
> +               clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>
>         cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>
> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>  static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
>         struct swap_cluster_info *info, *ci;
> -       unsigned int idx;
> +       unsigned int idx, i;
>
>         info = si->cluster_info;
>
> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>                 __free_cluster(si, idx);
>                 memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>                                 0, SWAPFILE_CLUSTER);
> +               for (i = 0; i < SWAPFILE_CLUSTER; i++)
> +                       clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>                 unlock_cluster(ci);
>         }
>  }
> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>         count = p->swap_map[offset];
>         VM_BUG_ON(count != SWAP_HAS_CACHE);
>         p->swap_map[offset] = 0;
> +       clear_bit(offset, p->zeromap);
>         dec_cluster_info_page(p, p->cluster_info, offset);
>         unlock_cluster(ci);
>
> @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         free_percpu(p->cluster_next_cpu);
>         p->cluster_next_cpu = NULL;
>         vfree(swap_map);
> +       bitmap_free(p->zeromap);
>         kvfree(cluster_info);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
> @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 goto bad_swap_unlock_inode;
>         }
>
> +       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> +       if (!p->zeromap) {
> +               error = -ENOMEM;
> +               goto bad_swap_unlock_inode;
> +       }
> +
>         if (p->bdev && bdev_stable_writes(p->bdev))
>                 p->flags |= SWP_STABLE_WRITES;
>
> --
> 2.43.0
>
Usama Arif June 10, 2024, 6:36 p.m. UTC | #7
On 10/06/2024 18:57, Yosry Ahmed wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>> Rather than reading/writing these pages to flash resulting
>> in increased I/O and flash wear, a bitmap can be used to mark these
>> pages as zero at write time, and the pages can be filled at
>> read time if the bit corresponding to the page is set.
>> With this patch, NVMe writes in Meta server fleet decreased
>> by almost 10% with conventional swap setup (zswap disabled).
>>
>> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   include/linux/swap.h |  1 +
>>   mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
>>   mm/swapfile.c        | 21 +++++++++-
>>   3 files changed, 111 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a11c75e897ec..e88563978441 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -299,6 +299,7 @@ struct swap_info_struct {
>>          signed char     type;           /* strange name for an index */
>>          unsigned int    max;            /* extent of the swap_map */
>>          unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
>> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>>          struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>          struct swap_cluster_list free_clusters; /* free clusters list */
>>          unsigned int lowest_bit;        /* index of first free in swap_map */
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index a360857cf75d..2cac1e11fb85 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>>          goto out;
>>   }
>>
>> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
>> +{
>> +       unsigned long *data;
>> +       unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
>> +       bool ret = false;
>> +
>> +       data = kmap_local_folio(folio, i * PAGE_SIZE);
>> +       if (data[last_pos])
>> +               goto out;
>> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
>> +               if (data[pos])
>> +                       goto out;
>> +       }
>> +       ret = true;
>> +out:
>> +       kunmap_local(data);
>> +       return ret;
>> +}
>> +
>> +static bool is_folio_zero_filled(struct folio *folio)
>> +{
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               if (!is_folio_page_zero_filled(folio, i))
>> +                       return false;
>> +       }
>> +       return true;
>> +}
> Is there any benefit to iterating on the folio in pages (i.e. have
> both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
> don't we just kmap the entire folio and check it all at once?

Is there an API to kmap an entire folio?

I could just do data = page_address(&folio->page) in above and iterate 
through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one 
function, but this just looks much nicer and much more readable?

In other places in the kernel, the folio iteration is through pages as well:

https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29

https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162

and others as well.


>> +
>> +static void folio_zero_fill(struct folio *folio)
>> +{
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++)
>> +               clear_highpage(folio_page(folio, i));
>> +}
>> +
>> +static void swap_zeromap_folio_set(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               set_bit(swp_offset(entry), sis->zeromap);
>> +       }
>> +}
>> +
>> +static void swap_zeromap_folio_clear(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               clear_bit(swp_offset(entry), sis->zeromap);
>> +       }
>> +}
>> +
>> +static bool swap_zeromap_folio_test(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               if (!test_bit(swp_offset(entry), sis->zeromap))
>> +                       return false;
>> +       }
>> +       return true;
>> +}
>> +
>>   /*
>>    * We may have stale swap cache pages in memory: notice
>>    * them here and get rid of the unnecessary final write.
>> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>                  folio_unlock(folio);
>>                  return ret;
>>          }
>> +
>> +       if (is_folio_zero_filled(folio)) {
>> +               swap_zeromap_folio_set(folio);
>> +               folio_start_writeback(folio);
>> +               folio_unlock(folio);
>> +               folio_end_writeback(folio);
>> +               return 0;
>> +       }
>> +       swap_zeromap_folio_clear(folio);
>>          if (zswap_store(folio)) {
>>                  folio_start_writeback(folio);
>>                  folio_unlock(folio);
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>>                  psi_memstall_enter(&pflags);
>>          }
>>          delayacct_swapin_start();
>> -
>> -       if (zswap_load(folio)) {
>> +       if (swap_zeromap_folio_test(folio)) {
>> +               folio_zero_fill(folio);
>> +               folio_mark_uptodate(folio);
>> +               folio_unlock(folio);
> We don't currently support swapping in large folios, but it is a work
> in progress, and this will break once we have it.
> swap_zeromap_folio_test() will return false even if parts of the folio
> are in fact zero-filled. Then, we will go read those from disk swap,
> essentially corrupting data.

So yes, with this patch I tested swap out of large zero folio, but when 
swapping in it was page by page. My assumption was that swap_read_folio 
(when support is added) would only pass a large folio that was earlier 
swapped out as a large folio. So if a zero filled large folio was 
swapped out, the zeromap will be set for all the pages in that folio, 
and at large folio swap in (when it is supported), it will see that all 
the bits in the zeromap for that folio are set,  and will just 
folio_zero_fill.

If even a single page in large folio has non-zero data then zeromap will 
not store it and it will go to either zswap or disk, and at read time as 
all the bits in zeromap are not set, it will still goto either zswap or 
disk, so I think this works?

Is my assumption wrong that only large folios can be swapped in only if 
they were swapped out as large? I think this code works in that case.

>
> The same problem can happen for zswap, if a large folio being swapped
> is only partially in zswap. In both cases, it's really easy to miss
> the problem if we're testing with zswap disabled, with incompressible
> data, or with non-zero data. Silent data corruption is not very
> debuggable.
>
> I proposed adding handling for this case in zswap here:
> https://lore.kernel.org/lkml/20240608023654.3513385-1-yosryahmed@google.com/
>
> The discussions there hadn't settled yet, but depending on how it pans
> out I suspect we will want something similar for the zeromap case as
> well.
>
>> +       } else if (zswap_load(folio)) {
>>                  folio_mark_uptodate(folio);
>>                  folio_unlock(folio);
>>          } else if (data_race(sis->flags & SWP_FS_OPS)) {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f1e559e216bd..90451174fe34 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>>   static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>                  unsigned int idx)
>>   {
>> +       unsigned int i;
>> +
>>          /*
>>           * If scan_swap_map_slots() can't find a free cluster, it will check
>>           * si->swap_map directly. To make sure the discarding cluster isn't
>> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>           */
>>          memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>                          SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> +       /*
>> +        * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
>> +        * call on other slots, hence use atomic clear_bit for zeromap instead of the
>> +        * non-atomic bitmap_clear.
>> +        */
>> +       for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> +               clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>
>>          cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>>
>> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>>   static void swap_do_scheduled_discard(struct swap_info_struct *si)
>>   {
>>          struct swap_cluster_info *info, *ci;
>> -       unsigned int idx;
>> +       unsigned int idx, i;
>>
>>          info = si->cluster_info;
>>
>> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>>                  __free_cluster(si, idx);
>>                  memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>                                  0, SWAPFILE_CLUSTER);
>> +               for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> +                       clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>                  unlock_cluster(ci);
>>          }
>>   }
>> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>>          count = p->swap_map[offset];
>>          VM_BUG_ON(count != SWAP_HAS_CACHE);
>>          p->swap_map[offset] = 0;
>> +       clear_bit(offset, p->zeromap);
>>          dec_cluster_info_page(p, p->cluster_info, offset);
>>          unlock_cluster(ci);
>>
>> @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>          free_percpu(p->cluster_next_cpu);
>>          p->cluster_next_cpu = NULL;
>>          vfree(swap_map);
>> +       bitmap_free(p->zeromap);
>>          kvfree(cluster_info);
>>          /* Destroy swap account information */
>>          swap_cgroup_swapoff(p->type);
>> @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>                  goto bad_swap_unlock_inode;
>>          }
>>
>> +       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>> +       if (!p->zeromap) {
>> +               error = -ENOMEM;
>> +               goto bad_swap_unlock_inode;
>> +       }
>> +
>>          if (p->bdev && bdev_stable_writes(p->bdev))
>>                  p->flags |= SWP_STABLE_WRITES;
>>
>> --
>> 2.43.0
>>
Yosry Ahmed June 10, 2024, 6:47 p.m. UTC | #8
On Mon, Jun 10, 2024 at 11:36 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 10/06/2024 18:57, Yosry Ahmed wrote:
> > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >> Approximately 10-20% of pages to be swapped out are zero pages [1].
> >> Rather than reading/writing these pages to flash resulting
> >> in increased I/O and flash wear, a bitmap can be used to mark these
> >> pages as zero at write time, and the pages can be filled at
> >> read time if the bit corresponding to the page is set.
> >> With this patch, NVMe writes in Meta server fleet decreased
> >> by almost 10% with conventional swap setup (zswap disabled).
> >>
> >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>   include/linux/swap.h |  1 +
> >>   mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
> >>   mm/swapfile.c        | 21 +++++++++-
> >>   3 files changed, 111 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index a11c75e897ec..e88563978441 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -299,6 +299,7 @@ struct swap_info_struct {
> >>          signed char     type;           /* strange name for an index */
> >>          unsigned int    max;            /* extent of the swap_map */
> >>          unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> >> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >>          struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >>          struct swap_cluster_list free_clusters; /* free clusters list */
> >>          unsigned int lowest_bit;        /* index of first free in swap_map */
> >> diff --git a/mm/page_io.c b/mm/page_io.c
> >> index a360857cf75d..2cac1e11fb85 100644
> >> --- a/mm/page_io.c
> >> +++ b/mm/page_io.c
> >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
> >>          goto out;
> >>   }
> >>
> >> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> >> +{
> >> +       unsigned long *data;
> >> +       unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> >> +       bool ret = false;
> >> +
> >> +       data = kmap_local_folio(folio, i * PAGE_SIZE);
> >> +       if (data[last_pos])
> >> +               goto out;
> >> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> >> +               if (data[pos])
> >> +                       goto out;
> >> +       }
> >> +       ret = true;
> >> +out:
> >> +       kunmap_local(data);
> >> +       return ret;
> >> +}
> >> +
> >> +static bool is_folio_zero_filled(struct folio *folio)
> >> +{
> >> +       unsigned int i;
> >> +
> >> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> >> +               if (!is_folio_page_zero_filled(folio, i))
> >> +                       return false;
> >> +       }
> >> +       return true;
> >> +}
> > Is there any benefit to iterating on the folio in pages (i.e. have
> > both is_folio_zero_filled() and is_folio_page_zero_filled())? Why
> > don't we just kmap the entire folio and check it all at once?
>
> Is there an API to kmap an entire folio?

I thought kmap_local_folio() takes in a 'size' parameter for some
reason, my bad.

>
> I could just do data = page_address(&folio->page) in above and iterate
> through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one
> function, but this just looks much nicer and much more readable?

Yeah if we need to map each page individually, the current code is better.

>
> In other places in the kernel, the folio iteration is through pages as well:
>
> https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29
>
> https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162
>
> and others as well.
[..]
> >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
> >>                  psi_memstall_enter(&pflags);
> >>          }
> >>          delayacct_swapin_start();
> >> -
> >> -       if (zswap_load(folio)) {
> >> +       if (swap_zeromap_folio_test(folio)) {
> >> +               folio_zero_fill(folio);
> >> +               folio_mark_uptodate(folio);
> >> +               folio_unlock(folio);
> > We don't currently support swapping in large folios, but it is a work
> > in progress, and this will break once we have it.
> > swap_zeromap_folio_test() will return false even if parts of the folio
> > are in fact zero-filled. Then, we will go read those from disk swap,
> > essentially corrupting data.
>
> So yes, with this patch I tested swap out of large zero folio, but when
> swapping in it was page by page. My assumption was that swap_read_folio
> (when support is added) would only pass a large folio that was earlier
> swapped out as a large folio. So if a zero filled large folio was
> swapped out, the zeromap will be set for all the pages in that folio,
> and at large folio swap in (when it is supported), it will see that all
> the bits in the zeromap for that folio are set,  and will just
> folio_zero_fill.
>
> If even a single page in large folio has non-zero data then zeromap will
> not store it and it will go to either zswap or disk, and at read time as
> all the bits in zeromap are not set, it will still goto either zswap or
> disk, so I think this works?
>
> Is my assumption wrong that only large folios can be swapped in only if
> they were swapped out as large? I think this code works in that case.

I think the assumption is incorrect. I think we would just check if
contiguous PTEs have contiguous swap entries and swapin the folio as a
large folio in this case. It is likely that the swap entries are
contiguous because it was swapped out as a large folio, but I don't
think it's guaranteed.

For example, here is a patch that implements large swapin support for
the synchronous swapin case, and I think it just checks that the PTEs
have contiguous swap entries:
https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/

This makes a lot of sense because otherwise you'd have to keep track
of how the folios were composed at the time of swapout, to be able to
swap the same folios back in.
Usama Arif June 11, 2024, 11:49 a.m. UTC | #9
@@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool 
synchronous,
>>>>                   psi_memstall_enter(&pflags);
>>>>           }
>>>>           delayacct_swapin_start();
>>>> -
>>>> -       if (zswap_load(folio)) {
>>>> +       if (swap_zeromap_folio_test(folio)) {
>>>> +               folio_zero_fill(folio);
>>>> +               folio_mark_uptodate(folio);
>>>> +               folio_unlock(folio);
>>> We don't currently support swapping in large folios, but it is a work
>>> in progress, and this will break once we have it.
>>> swap_zeromap_folio_test() will return false even if parts of the folio
>>> are in fact zero-filled. Then, we will go read those from disk swap,
>>> essentially corrupting data.
>> So yes, with this patch I tested swap out of large zero folio, but when
>> swapping in it was page by page. My assumption was that swap_read_folio
>> (when support is added) would only pass a large folio that was earlier
>> swapped out as a large folio. So if a zero filled large folio was
>> swapped out, the zeromap will be set for all the pages in that folio,
>> and at large folio swap in (when it is supported), it will see that all
>> the bits in the zeromap for that folio are set,  and will just
>> folio_zero_fill.
>>
>> If even a single page in large folio has non-zero data then zeromap will
>> not store it and it will go to either zswap or disk, and at read time as
>> all the bits in zeromap are not set, it will still goto either zswap or
>> disk, so I think this works?
>>
>> Is my assumption wrong that only large folios can be swapped in only if
>> they were swapped out as large? I think this code works in that case.
> I think the assumption is incorrect. I think we would just check if
> contiguous PTEs have contiguous swap entries and swapin the folio as a
> large folio in this case. It is likely that the swap entries are
> contiguous because it was swapped out as a large folio, but I don't
> think it's guaranteed.

Yes, makes sense. Thanks for explaining this.

>
> For example, here is a patch that implements large swapin support for
> the synchronous swapin case, and I think it just checks that the PTEs
> have contiguous swap entries:
> https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/
>
> This makes a lot of sense because otherwise you'd have to keep track
> of how the folios were composed at the time of swapout, to be able to
> swap the same folios back in.

I think the solution to large folio swap-in for this optimization and 
zswap is not in swap_read_folio in this patch-series or any call further 
down the stack, as its too late by the time you reach here, but in 
Barrys' patch. This needs to happen much earlier when deciding the size 
of the folio at folio creation in alloc_swap_folio, after Barry checks

     if (is_zswap_enabled())
         goto fallback;

once the order is decided, we would need to check the indexes in the 
zeromap array starting from swap_offset(entry) and ending at 
swap_offset(entry) + 2^order are set. If no bits are set, or all bits 
are set, then we allocate large folio. Otherwise, we goto fallback.

I think its better to handle this in Barrys patch. I feel this series is 
close to its final state, i.e. the only diff I have for the next 
revision is below to remove start/end_writeback for zer_filled case. I 
will comment on Barrys patch once the I send out the next revision of this.


diff --git a/mm/page_io.c b/mm/page_io.c
index 2cac1e11fb85..08a3772ddcf7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -274,9 +274,7 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)

         if (is_folio_zero_filled(folio)) {
                 swap_zeromap_folio_set(folio);
-               folio_start_writeback(folio);
                 folio_unlock(folio);
-               folio_end_writeback(folio);
                 return 0;
         }
         swap_zeromap_folio_clear(folio);
Yosry Ahmed June 11, 2024, 3:42 p.m. UTC | #10
On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
> >>>>                   psi_memstall_enter(&pflags);
> >>>>           }
> >>>>           delayacct_swapin_start();
> >>>> -
> >>>> -       if (zswap_load(folio)) {
> >>>> +       if (swap_zeromap_folio_test(folio)) {
> >>>> +               folio_zero_fill(folio);
> >>>> +               folio_mark_uptodate(folio);
> >>>> +               folio_unlock(folio);
> >>> We don't currently support swapping in large folios, but it is a work
> >>> in progress, and this will break once we have it.
> >>> swap_zeromap_folio_test() will return false even if parts of the folio
> >>> are in fact zero-filled. Then, we will go read those from disk swap,
> >>> essentially corrupting data.
> >> So yes, with this patch I tested swap out of large zero folio, but when
> >> swapping in it was page by page. My assumption was that swap_read_folio
> >> (when support is added) would only pass a large folio that was earlier
> >> swapped out as a large folio. So if a zero filled large folio was
> >> swapped out, the zeromap will be set for all the pages in that folio,
> >> and at large folio swap in (when it is supported), it will see that all
> >> the bits in the zeromap for that folio are set,  and will just
> >> folio_zero_fill.
> >>
> >> If even a single page in large folio has non-zero data then zeromap will
> >> not store it and it will go to either zswap or disk, and at read time as
> >> all the bits in zeromap are not set, it will still goto either zswap or
> >> disk, so I think this works?
> >>
> >> Is my assumption wrong that only large folios can be swapped in only if
> >> they were swapped out as large? I think this code works in that case.
> > I think the assumption is incorrect. I think we would just check if
> > contiguous PTEs have contiguous swap entries and swapin the folio as a
> > large folio in this case. It is likely that the swap entries are
> > contiguous because it was swapped out as a large folio, but I don't
> > think it's guaranteed.
>
> Yes, makes sense. Thanks for explaining this.
>
> >
> > For example, here is a patch that implements large swapin support for
> > the synchronous swapin case, and I think it just checks that the PTEs
> > have contiguous swap entries:
> > https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/
> >
> > This makes a lot of sense because otherwise you'd have to keep track
> > of how the folios were composed at the time of swapout, to be able to
> > swap the same folios back in.
>
> I think the solution to large folio swap-in for this optimization and
> zswap is not in swap_read_folio in this patch-series or any call further
> down the stack, as its too late by the time you reach here, but in
> Barrys' patch. This needs to happen much earlier when deciding the size
> of the folio at folio creation in alloc_swap_folio, after Barry checks
>
>      if (is_zswap_enabled())
>          goto fallback;
>
> once the order is decided, we would need to check the indexes in the
> zeromap array starting from swap_offset(entry) and ending at
> swap_offset(entry) + 2^order are set. If no bits are set, or all bits
> are set, then we allocate large folio. Otherwise, we goto fallback.
>
> I think its better to handle this in Barrys patch. I feel this series is
> close to its final state, i.e. the only diff I have for the next
> revision is below to remove start/end_writeback for zer_filled case. I
> will comment on Barrys patch once the I send out the next revision of this.

Sorry I did not make myself clearer. I did not mean that you should
handle the large folio swapin here. This needs to be handled at a
higher level because as you mentioned, a large folio may be partially
in the zeromap, zswap, swapcache, disk, etc.

What I meant is that we should probably have a debug check to make
sure this doesn't go unhandled. For zswap, I am trying to add a
warning and fail the swapin operation if a large folio slips through
to zswap. We can do something similar here if folks agree this is the
right way in the interim:
https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/.

Maybe I am too paranoid, but I think it's easy to mess up these things
when working on large folio swapin imo.
Usama Arif June 11, 2024, 4:52 p.m. UTC | #11
On 11/06/2024 16:42, Yosry Ahmed wrote:
> On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>>>>>>                    psi_memstall_enter(&pflags);
>>>>>>            }
>>>>>>            delayacct_swapin_start();
>>>>>> -
>>>>>> -       if (zswap_load(folio)) {
>>>>>> +       if (swap_zeromap_folio_test(folio)) {
>>>>>> +               folio_zero_fill(folio);
>>>>>> +               folio_mark_uptodate(folio);
>>>>>> +               folio_unlock(folio);
>>>>> We don't currently support swapping in large folios, but it is a work
>>>>> in progress, and this will break once we have it.
>>>>> swap_zeromap_folio_test() will return false even if parts of the folio
>>>>> are in fact zero-filled. Then, we will go read those from disk swap,
>>>>> essentially corrupting data.
>>>> So yes, with this patch I tested swap out of large zero folio, but when
>>>> swapping in it was page by page. My assumption was that swap_read_folio
>>>> (when support is added) would only pass a large folio that was earlier
>>>> swapped out as a large folio. So if a zero filled large folio was
>>>> swapped out, the zeromap will be set for all the pages in that folio,
>>>> and at large folio swap in (when it is supported), it will see that all
>>>> the bits in the zeromap for that folio are set,  and will just
>>>> folio_zero_fill.
>>>>
>>>> If even a single page in large folio has non-zero data then zeromap will
>>>> not store it and it will go to either zswap or disk, and at read time as
>>>> all the bits in zeromap are not set, it will still goto either zswap or
>>>> disk, so I think this works?
>>>>
>>>> Is my assumption wrong that only large folios can be swapped in only if
>>>> they were swapped out as large? I think this code works in that case.
>>> I think the assumption is incorrect. I think we would just check if
>>> contiguous PTEs have contiguous swap entries and swapin the folio as a
>>> large folio in this case. It is likely that the swap entries are
>>> contiguous because it was swapped out as a large folio, but I don't
>>> think it's guaranteed.
>> Yes, makes sense. Thanks for explaining this.
>>
>>> For example, here is a patch that implements large swapin support for
>>> the synchronous swapin case, and I think it just checks that the PTEs
>>> have contiguous swap entries:
>>> https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/
>>>
>>> This makes a lot of sense because otherwise you'd have to keep track
>>> of how the folios were composed at the time of swapout, to be able to
>>> swap the same folios back in.
>> I think the solution to large folio swap-in for this optimization and
>> zswap is not in swap_read_folio in this patch-series or any call further
>> down the stack, as its too late by the time you reach here, but in
>> Barrys' patch. This needs to happen much earlier when deciding the size
>> of the folio at folio creation in alloc_swap_folio, after Barry checks
>>
>>       if (is_zswap_enabled())
>>           goto fallback;
>>
>> once the order is decided, we would need to check the indexes in the
>> zeromap array starting from swap_offset(entry) and ending at
>> swap_offset(entry) + 2^order are set. If no bits are set, or all bits
>> are set, then we allocate large folio. Otherwise, we goto fallback.
>>
>> I think its better to handle this in Barrys patch. I feel this series is
>> close to its final state, i.e. the only diff I have for the next
>> revision is below to remove start/end_writeback for zer_filled case. I
>> will comment on Barrys patch once the I send out the next revision of this.
> Sorry I did not make myself clearer. I did not mean that you should
> handle the large folio swapin here. This needs to be handled at a
> higher level because as you mentioned, a large folio may be partially
> in the zeromap, zswap, swapcache, disk, etc.
>
> What I meant is that we should probably have a debug check to make
> sure this doesn't go unhandled. For zswap, I am trying to add a
> warning and fail the swapin operation if a large folio slips through
> to zswap. We can do something similar here if folks agree this is the
> right way in the interim:
> https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/.
>
> Maybe I am too paranoid, but I think it's easy to mess up these things
> when working on large folio swapin imo.

So there is a difference between zswap and this optimization. In this 
optimization, if the zeromap is set for all the folio bits, then we 
should do large folio swapin. There still needs to be a change in Barrys 
patch in alloc_swap_folio, but apart from that does the below diff over 
v3 make it better? I will send a v4 with this if it sounds good.


diff --git a/mm/page_io.c b/mm/page_io.c
index 6400be6e4291..bf01364748a9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio 
*folio)
         }
  }

-static bool swap_zeromap_folio_test(struct folio *folio)
+/*
+ * Return the index of the first subpage which is not zero-filled
+ * according to swap_info_struct->zeromap.
+ * If all pages are zero-filled according to zeromap, it will return
+ * folio_nr_pages(folio).
+ */
+static long swap_zeromap_folio_test(struct folio *folio)
  {
         struct swap_info_struct *sis = swp_swap_info(folio->swap);
         swp_entry_t entry;
-       unsigned int i;
+       long i;

         for (i = 0; i < folio_nr_pages(folio); i++) {
                 entry = page_swap_entry(folio_page(folio, i));
                 if (!test_bit(swp_offset(entry), sis->zeromap))
-                       return false;
+                       return i;
         }
-       return true;
+       return i;
  }

  /*
@@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool 
synchronous,
  {
         struct swap_info_struct *sis = swp_swap_info(folio->swap);
         bool workingset = folio_test_workingset(folio);
+       long first_non_zero_page_idx;
         unsigned long pflags;
         bool in_thrashing;

@@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool 
synchronous,
                 psi_memstall_enter(&pflags);
         }
         delayacct_swapin_start();
-       if (swap_zeromap_folio_test(folio)) {
+       first_non_zero_page_idx = swap_zeromap_folio_test(folio);
+       if (first_non_zero_page_idx == folio_nr_pages(folio)) {
                 folio_zero_fill(folio);
                 folio_mark_uptodate(folio);
                 folio_unlock(folio);
+       } else if (first_non_zero_page_idx != 0) {
+               /*
+                * The case for when only *some* of subpages being 
swapped-in were recorded
+                * in sis->zeromap, while the rest are in zswap/disk is 
currently not handled.
+                * WARN in this case and return without marking the 
folio uptodate so that
+                * an IO error is emitted (e.g. do_swap_page() will sigbus).
+                */
+                WARN_ON_ONCE(1);
         } else if (zswap_load(folio)) {
                 folio_mark_uptodate(folio);
                 folio_unlock(folio);
Yosry Ahmed June 11, 2024, 5:51 p.m. UTC | #12
[..]
> >> I think its better to handle this in Barrys patch. I feel this series is
> >> close to its final state, i.e. the only diff I have for the next
> >> revision is below to remove start/end_writeback for zer_filled case. I
> >> will comment on Barrys patch once the I send out the next revision of this.
> > Sorry I did not make myself clearer. I did not mean that you should
> > handle the large folio swapin here. This needs to be handled at a
> > higher level because as you mentioned, a large folio may be partially
> > in the zeromap, zswap, swapcache, disk, etc.
> >
> > What I meant is that we should probably have a debug check to make
> > sure this doesn't go unhandled. For zswap, I am trying to add a
> > warning and fail the swapin operation if a large folio slips through
> > to zswap. We can do something similar here if folks agree this is the
> > right way in the interim:
> > https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/.
> >
> > Maybe I am too paranoid, but I think it's easy to mess up these things
> > when working on large folio swapin imo.
>
> So there is a difference between zswap and this optimization. In this
> optimization, if the zeromap is set for all the folio bits, then we
> should do large folio swapin. There still needs to be a change in Barrys
> patch in alloc_swap_folio, but apart from that does the below diff over
> v3 make it better? I will send a v4 with this if it sounds good.
>
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 6400be6e4291..bf01364748a9 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio
> *folio)
>          }
>   }
>
> -static bool swap_zeromap_folio_test(struct folio *folio)
> +/*
> + * Return the index of the first subpage which is not zero-filled
> + * according to swap_info_struct->zeromap.
> + * If all pages are zero-filled according to zeromap, it will return
> + * folio_nr_pages(folio).
> + */
> +static long swap_zeromap_folio_test(struct folio *folio)
>   {
>          struct swap_info_struct *sis = swp_swap_info(folio->swap);
>          swp_entry_t entry;
> -       unsigned int i;
> +       long i;

Why long?

>
>          for (i = 0; i < folio_nr_pages(folio); i++) {
>                  entry = page_swap_entry(folio_page(folio, i));
>                  if (!test_bit(swp_offset(entry), sis->zeromap))
> -                       return false;
> +                       return i;
>          }
> -       return true;
> +       return i;
>   }
>
>   /*
> @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
>   {
>          struct swap_info_struct *sis = swp_swap_info(folio->swap);
>          bool workingset = folio_test_workingset(folio);
> +       long first_non_zero_page_idx;
>          unsigned long pflags;
>          bool in_thrashing;
>
> @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
>                  psi_memstall_enter(&pflags);
>          }
>          delayacct_swapin_start();
> -       if (swap_zeromap_folio_test(folio)) {
> +       first_non_zero_page_idx = swap_zeromap_folio_test(folio);
> +       if (first_non_zero_page_idx == folio_nr_pages(folio)) {
>                  folio_zero_fill(folio);
>                  folio_mark_uptodate(folio);
>                  folio_unlock(folio);
> +       } else if (first_non_zero_page_idx != 0) {
> +               /*
> +                * The case for when only *some* of subpages being
> swapped-in were recorded
> +                * in sis->zeromap, while the rest are in zswap/disk is
> currently not handled.
> +                * WARN in this case and return without marking the
> folio uptodate so that
> +                * an IO error is emitted (e.g. do_swap_page() will sigbus).
> +                */
> +                WARN_ON_ONCE(1);
>          } else if (zswap_load(folio)) {
>                  folio_mark_uptodate(folio);
>                  folio_unlock(folio);
>
>

This is too much noise for swap_read_folio(). How about adding
swap_read_folio_zeromap() that takes care of this and decides whether
or not to call folio_mark_uptodate()?

-static bool swap_zeromap_folio_test(struct folio *folio)
+/*
+ * Return the index of the first subpage which is not zero-filled according to
+ * swap_info_struct->zeromap.  If all pages are zero-filled according to
+ * zeromap, it will return folio_nr_pages(folio).
+ */
+static unsigned int swap_zeromap_folio_test(struct folio *folio)
 {
        struct swap_info_struct *sis = swp_swap_info(folio->swap);
        swp_entry_t entry;
@@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio)
        for (i = 0; i < folio_nr_pages(folio); i++) {
                entry = page_swap_entry(folio_page(folio, i));
                if (!test_bit(swp_offset(entry), sis->zeromap))
-                       return false;
+                       return i;
        }
-       return true;
+       return i;
 }

 /*
@@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
        mempool_free(sio, sio_pool);
 }

+static bool swap_read_folio_zeromap(struct folio *folio)
+{
+       unsigned int idx = swap_zeromap_folio_test(folio);
+
+       if (idx == 0)
+               return false;
+
+       /*
+        * Swapping in a large folio that is partially in the zeromap is not
+        * currently handled. Return true without marking the folio uptodate so
+        * that an IO error is emitted (e.g.  do_swap_page() will sigbus).
+        */
+       if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
+               return true;
+
+       folio_zero_fill(folio);
+       folio_mark_uptodate(folio);
+       return true
+}
+
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
        struct swap_info_struct *sis = swp_swap_info(folio->swap);
@@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous,
                psi_memstall_enter(&pflags);
        }
        delayacct_swapin_start();
-       if (swap_zeromap_folio_test(folio)) {
-               folio_zero_fill(folio);
-               folio_mark_uptodate(folio);
+       if (swap_read_folio_zeromap(folio)) {
                folio_unlock(folio);
        } else if (zswap_load(folio)) {
                folio_mark_uptodate(folio);
Nhat Pham June 11, 2024, 6:39 p.m. UTC | #13
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/swap.h |  1 +
>  mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/swapfile.c        | 21 +++++++++-
>  3 files changed, 111 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
>         signed char     type;           /* strange name for an index */
>         unsigned int    max;            /* extent of the swap_map */
>         unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>         struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>         struct swap_cluster_list free_clusters; /* free clusters list */
>         unsigned int lowest_bit;        /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..2cac1e11fb85 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>         goto out;
>  }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> +       unsigned long *data;
> +       unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
> +       bool ret = false;
> +
> +       data = kmap_local_folio(folio, i * PAGE_SIZE);
> +       if (data[last_pos])
> +               goto out;
> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
> +               if (data[pos])
> +                       goto out;
> +       }
> +       ret = true;
> +out:
> +       kunmap_local(data);
> +       return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               if (!is_folio_page_zero_filled(folio, i))
> +                       return false;
> +       }
> +       return true;
> +}
> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++)
> +               clear_highpage(folio_page(folio, i));
> +}
> +
> +static void swap_zeromap_folio_set(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               set_bit(swp_offset(entry), sis->zeromap);
> +       }
> +}
> +
> +static void swap_zeromap_folio_clear(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               clear_bit(swp_offset(entry), sis->zeromap);
> +       }
> +}
> +
> +static bool swap_zeromap_folio_test(struct folio *folio)
> +{
> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       swp_entry_t entry;
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               entry = page_swap_entry(folio_page(folio, i));
> +               if (!test_bit(swp_offset(entry), sis->zeromap))
> +                       return false;
> +       }
> +       return true;
> +}
> +
>  /*
>   * We may have stale swap cache pages in memory: notice
>   * them here and get rid of the unnecessary final write.
> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 folio_unlock(folio);
>                 return ret;
>         }
> +
> +       if (is_folio_zero_filled(folio)) {
> +               swap_zeromap_folio_set(folio);
> +               folio_start_writeback(folio);
> +               folio_unlock(folio);
> +               folio_end_writeback(folio);
> +               return 0;
> +       }
> +       swap_zeromap_folio_clear(folio);
>         if (zswap_store(folio)) {
>                 folio_start_writeback(folio);
>                 folio_unlock(folio);
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>                 psi_memstall_enter(&pflags);
>         }
>         delayacct_swapin_start();
> -
> -       if (zswap_load(folio)) {
> +       if (swap_zeromap_folio_test(folio)) {
> +               folio_zero_fill(folio);
> +               folio_mark_uptodate(folio);
> +               folio_unlock(folio);
> +       } else if (zswap_load(folio)) {
>                 folio_mark_uptodate(folio);
>                 folio_unlock(folio);
>         } else if (data_race(sis->flags & SWP_FS_OPS)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f1e559e216bd..90451174fe34 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>                 unsigned int idx)
>  {
> +       unsigned int i;
> +
>         /*
>          * If scan_swap_map_slots() can't find a free cluster, it will check
>          * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>          */
>         memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>                         SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> +       /*
> +        * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
> +        * call on other slots, hence use atomic clear_bit for zeromap instead of the
> +        * non-atomic bitmap_clear.
> +        */
> +       for (i = 0; i < SWAPFILE_CLUSTER; i++)
> +               clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>
>         cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>
> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>  static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
>         struct swap_cluster_info *info, *ci;
> -       unsigned int idx;
> +       unsigned int idx, i;
>
>         info = si->cluster_info;
>
> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>                 __free_cluster(si, idx);
>                 memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>                                 0, SWAPFILE_CLUSTER);
> +               for (i = 0; i < SWAPFILE_CLUSTER; i++)
> +                       clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>                 unlock_cluster(ci);
>         }
>  }
> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>         count = p->swap_map[offset];
>         VM_BUG_ON(count != SWAP_HAS_CACHE);
>         p->swap_map[offset] = 0;
> +       clear_bit(offset, p->zeromap);

Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
we can have a problem, where:

1. The swap entry has its zeromap bit set, and is freed to the swap
slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
reclaimed from the swap cache, and all the processes referring to it
are terminated, which decrements the swap count to 0 (swap_free() ->
__swap_entry_free() -> free_swap_slots())

2. The swap slot is then re-used in swap space allocation
(add_to_swap()) - its zeromap bit is never cleared.

3. swap_writepage() writes that non-zero page to swap

4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
the entry is set, so populates a zero page for it.

zswap in the past has to carefully invalidate these leftover entries
quite carefully. Chengming then move the invalidation point to
free_swap_slot(), massively simplifying the logic.

I wonder if we need to do the same here?
Usama Arif June 11, 2024, 6:43 p.m. UTC | #14
On 11/06/2024 18:51, Yosry Ahmed wrote:
> [..]
>>>> I think its better to handle this in Barrys patch. I feel this series is
>>>> close to its final state, i.e. the only diff I have for the next
>>>> revision is below to remove start/end_writeback for zer_filled case. I
>>>> will comment on Barrys patch once the I send out the next revision of this.
>>> Sorry I did not make myself clearer. I did not mean that you should
>>> handle the large folio swapin here. This needs to be handled at a
>>> higher level because as you mentioned, a large folio may be partially
>>> in the zeromap, zswap, swapcache, disk, etc.
>>>
>>> What I meant is that we should probably have a debug check to make
>>> sure this doesn't go unhandled. For zswap, I am trying to add a
>>> warning and fail the swapin operation if a large folio slips through
>>> to zswap. We can do something similar here if folks agree this is the
>>> right way in the interim:
>>> https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/.
>>>
>>> Maybe I am too paranoid, but I think it's easy to mess up these things
>>> when working on large folio swapin imo.
>> So there is a difference between zswap and this optimization. In this
>> optimization, if the zeromap is set for all the folio bits, then we
>> should do large folio swapin. There still needs to be a change in Barrys
>> patch in alloc_swap_folio, but apart from that does the below diff over
>> v3 make it better? I will send a v4 with this if it sounds good.
>>
>>
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index 6400be6e4291..bf01364748a9 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio
>> *folio)
>>           }
>>    }
>>
>> -static bool swap_zeromap_folio_test(struct folio *folio)
>> +/*
>> + * Return the index of the first subpage which is not zero-filled
>> + * according to swap_info_struct->zeromap.
>> + * If all pages are zero-filled according to zeromap, it will return
>> + * folio_nr_pages(folio).
>> + */
>> +static long swap_zeromap_folio_test(struct folio *folio)
>>    {
>>           struct swap_info_struct *sis = swp_swap_info(folio->swap);
>>           swp_entry_t entry;
>> -       unsigned int i;
>> +       long i;
> Why long?


folio_nr_pages returns long, but I just checked that 
folio->_folio_nr_pages is unsigned int, but that will probably be 
typecasted to long :). I will switch to unsigned int as its not really 
going to go to long for CONFIG_64BIT

>>           for (i = 0; i < folio_nr_pages(folio); i++) {
>>                   entry = page_swap_entry(folio_page(folio, i));
>>                   if (!test_bit(swp_offset(entry), sis->zeromap))
>> -                       return false;
>> +                       return i;
>>           }
>> -       return true;
>> +       return i;
>>    }
>>
>>    /*
>> @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>>    {
>>           struct swap_info_struct *sis = swp_swap_info(folio->swap);
>>           bool workingset = folio_test_workingset(folio);
>> +       long first_non_zero_page_idx;
>>           unsigned long pflags;
>>           bool in_thrashing;
>>
>> @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool
>> synchronous,
>>                   psi_memstall_enter(&pflags);
>>           }
>>           delayacct_swapin_start();
>> -       if (swap_zeromap_folio_test(folio)) {
>> +       first_non_zero_page_idx = swap_zeromap_folio_test(folio);
>> +       if (first_non_zero_page_idx == folio_nr_pages(folio)) {
>>                   folio_zero_fill(folio);
>>                   folio_mark_uptodate(folio);
>>                   folio_unlock(folio);
>> +       } else if (first_non_zero_page_idx != 0) {
>> +               /*
>> +                * The case for when only *some* of subpages being
>> swapped-in were recorded
>> +                * in sis->zeromap, while the rest are in zswap/disk is
>> currently not handled.
>> +                * WARN in this case and return without marking the
>> folio uptodate so that
>> +                * an IO error is emitted (e.g. do_swap_page() will sigbus).
>> +                */
>> +                WARN_ON_ONCE(1);
>>           } else if (zswap_load(folio)) {
>>                   folio_mark_uptodate(folio);
>>                   folio_unlock(folio);
>>
>>
> This is too much noise for swap_read_folio(). How about adding
> swap_read_folio_zeromap() that takes care of this and decides whether
> or not to call folio_mark_uptodate()?

Sounds good, will do as below. Thanks!

>
> -static bool swap_zeromap_folio_test(struct folio *folio)
> +/*
> + * Return the index of the first subpage which is not zero-filled according to
> + * swap_info_struct->zeromap.  If all pages are zero-filled according to
> + * zeromap, it will return folio_nr_pages(folio).
> + */
> +static unsigned int swap_zeromap_folio_test(struct folio *folio)
>   {
>          struct swap_info_struct *sis = swp_swap_info(folio->swap);
>          swp_entry_t entry;
> @@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio)
>          for (i = 0; i < folio_nr_pages(folio); i++) {
>                  entry = page_swap_entry(folio_page(folio, i));
>                  if (!test_bit(swp_offset(entry), sis->zeromap))
> -                       return false;
> +                       return i;
>          }
> -       return true;
> +       return i;
>   }
>
>   /*
> @@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>          mempool_free(sio, sio_pool);
>   }
>
> +static bool swap_read_folio_zeromap(struct folio *folio)
> +{
> +       unsigned int idx = swap_zeromap_folio_test(folio);
> +
> +       if (idx == 0)
> +               return false;
> +
> +       /*
> +        * Swapping in a large folio that is partially in the zeromap is not
> +        * currently handled. Return true without marking the folio uptodate so
> +        * that an IO error is emitted (e.g.  do_swap_page() will sigbus).
> +        */
> +       if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
> +               return true;
> +
> +       folio_zero_fill(folio);
> +       folio_mark_uptodate(folio);
> +       return true
> +}
> +
>   static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
>   {
>          struct swap_info_struct *sis = swp_swap_info(folio->swap);
> @@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>                  psi_memstall_enter(&pflags);
>          }
>          delayacct_swapin_start();
> -       if (swap_zeromap_folio_test(folio)) {
> -               folio_zero_fill(folio);
> -               folio_mark_uptodate(folio);
> +       if (swap_read_folio_zeromap(folio)) {
>                  folio_unlock(folio);
>          } else if (zswap_load(folio)) {
>                  folio_mark_uptodate(folio);
Yosry Ahmed June 11, 2024, 6:46 p.m. UTC | #15
[..]
> > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >         count = p->swap_map[offset];
> >         VM_BUG_ON(count != SWAP_HAS_CACHE);
> >         p->swap_map[offset] = 0;
> > +       clear_bit(offset, p->zeromap);
>
> Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> we can have a problem, where:
>
> 1. The swap entry has its zeromap bit set, and is freed to the swap
> slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> reclaimed from the swap cache, and all the processes referring to it
> are terminated, which decrements the swap count to 0 (swap_free() ->
> __swap_entry_free() -> free_swap_slots())
>
> 2. The swap slot is then re-used in swap space allocation
> (add_to_swap()) - its zeromap bit is never cleared.

I do not think this can happen before swap_entry_free() is called.
Note that when a swap entry is freed to the swap slot cache in
free_swap_slot(), it is added to cache->slots_ret, not cache->slots.
The former are swap entries cached to be later freed using
swap_entry_free().

>
> 3. swap_writepage() writes that non-zero page to swap
>
> 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> the entry is set, so populates a zero page for it.
>
> zswap in the past has to carefully invalidate these leftover entries
> quite carefully. Chengming then move the invalidation point to
> free_swap_slot(), massively simplifying the logic.

I think the main benefit of moving the invalidation point was avoiding
leaving the compressed page in zswap until swap_entry_free() is
called, which will happen only when the swap slot caches are drained.

>
> I wonder if we need to do the same here?
Usama Arif June 11, 2024, 6:50 p.m. UTC | #16
On 11/06/2024 19:39, Nhat Pham wrote:
> On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>> Rather than reading/writing these pages to flash resulting
>> in increased I/O and flash wear, a bitmap can be used to mark these
>> pages as zero at write time, and the pages can be filled at
>> read time if the bit corresponding to the page is set.
>> With this patch, NVMe writes in Meta server fleet decreased
>> by almost 10% with conventional swap setup (zswap disabled).
>>
>> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   include/linux/swap.h |  1 +
>>   mm/page_io.c         | 92 +++++++++++++++++++++++++++++++++++++++++++-
>>   mm/swapfile.c        | 21 +++++++++-
>>   3 files changed, 111 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a11c75e897ec..e88563978441 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -299,6 +299,7 @@ struct swap_info_struct {
>>          signed char     type;           /* strange name for an index */
>>          unsigned int    max;            /* extent of the swap_map */
>>          unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
>> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>>          struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>          struct swap_cluster_list free_clusters; /* free clusters list */
>>          unsigned int lowest_bit;        /* index of first free in swap_map */
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index a360857cf75d..2cac1e11fb85 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>>          goto out;
>>   }
>>
>> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
>> +{
>> +       unsigned long *data;
>> +       unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
>> +       bool ret = false;
>> +
>> +       data = kmap_local_folio(folio, i * PAGE_SIZE);
>> +       if (data[last_pos])
>> +               goto out;
>> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
>> +               if (data[pos])
>> +                       goto out;
>> +       }
>> +       ret = true;
>> +out:
>> +       kunmap_local(data);
>> +       return ret;
>> +}
>> +
>> +static bool is_folio_zero_filled(struct folio *folio)
>> +{
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               if (!is_folio_page_zero_filled(folio, i))
>> +                       return false;
>> +       }
>> +       return true;
>> +}
>> +
>> +static void folio_zero_fill(struct folio *folio)
>> +{
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++)
>> +               clear_highpage(folio_page(folio, i));
>> +}
>> +
>> +static void swap_zeromap_folio_set(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               set_bit(swp_offset(entry), sis->zeromap);
>> +       }
>> +}
>> +
>> +static void swap_zeromap_folio_clear(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               clear_bit(swp_offset(entry), sis->zeromap);
>> +       }
>> +}
>> +
>> +static bool swap_zeromap_folio_test(struct folio *folio)
>> +{
>> +       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>> +       swp_entry_t entry;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < folio_nr_pages(folio); i++) {
>> +               entry = page_swap_entry(folio_page(folio, i));
>> +               if (!test_bit(swp_offset(entry), sis->zeromap))
>> +                       return false;
>> +       }
>> +       return true;
>> +}
>> +
>>   /*
>>    * We may have stale swap cache pages in memory: notice
>>    * them here and get rid of the unnecessary final write.
>> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>                  folio_unlock(folio);
>>                  return ret;
>>          }
>> +
>> +       if (is_folio_zero_filled(folio)) {
>> +               swap_zeromap_folio_set(folio);
>> +               folio_start_writeback(folio);
>> +               folio_unlock(folio);
>> +               folio_end_writeback(folio);
>> +               return 0;
>> +       }
>> +       swap_zeromap_folio_clear(folio);
>>          if (zswap_store(folio)) {
>>                  folio_start_writeback(folio);
>>                  folio_unlock(folio);
>> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>>                  psi_memstall_enter(&pflags);
>>          }
>>          delayacct_swapin_start();
>> -
>> -       if (zswap_load(folio)) {
>> +       if (swap_zeromap_folio_test(folio)) {
>> +               folio_zero_fill(folio);
>> +               folio_mark_uptodate(folio);
>> +               folio_unlock(folio);
>> +       } else if (zswap_load(folio)) {
>>                  folio_mark_uptodate(folio);
>>                  folio_unlock(folio);
>>          } else if (data_race(sis->flags & SWP_FS_OPS)) {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f1e559e216bd..90451174fe34 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>>   static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>                  unsigned int idx)
>>   {
>> +       unsigned int i;
>> +
>>          /*
>>           * If scan_swap_map_slots() can't find a free cluster, it will check
>>           * si->swap_map directly. To make sure the discarding cluster isn't
>> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>           */
>>          memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>                          SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> +       /*
>> +        * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
>> +        * call on other slots, hence use atomic clear_bit for zeromap instead of the
>> +        * non-atomic bitmap_clear.
>> +        */
>> +       for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> +               clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>
>>          cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>>
>> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>>   static void swap_do_scheduled_discard(struct swap_info_struct *si)
>>   {
>>          struct swap_cluster_info *info, *ci;
>> -       unsigned int idx;
>> +       unsigned int idx, i;
>>
>>          info = si->cluster_info;
>>
>> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>>                  __free_cluster(si, idx);
>>                  memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>                                  0, SWAPFILE_CLUSTER);
>> +               for (i = 0; i < SWAPFILE_CLUSTER; i++)
>> +                       clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
>>                  unlock_cluster(ci);
>>          }
>>   }
>> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>>          count = p->swap_map[offset];
>>          VM_BUG_ON(count != SWAP_HAS_CACHE);
>>          p->swap_map[offset] = 0;
>> +       clear_bit(offset, p->zeromap);
> Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> we can have a problem, where:
>
> 1. The swap entry has its zeromap bit set, and is freed to the swap
> slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> reclaimed from the swap cache, and all the processes referring to it
> are terminated, which decrements the swap count to 0 (swap_free() ->
> __swap_entry_free() -> free_swap_slots())
>
> 2. The swap slot is then re-used in swap space allocation
> (add_to_swap()) - its zeromap bit is never cleared.
>
> 3. swap_writepage() writes that non-zero page to swap

In swap_writepage, with this patch you have:

     if (is_folio_zero_filled(folio)) {
         swap_zeromap_folio_set(folio);
         folio_unlock(folio);
         return 0;
     }
     swap_zeromap_folio_clear(folio);

i.e. if folio is not zero filled, swap_zeromap_folio_clear will be 
called and the bit is cleared, so I think it would take care of this 
scenario? swap_read_folio will see the bit cleared in step 4.

> 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> the entry is set, so populates a zero page for it.
>
> zswap in the past has to carefully invalidate these leftover entries
> quite carefully. Chengming then move the invalidation point to
> free_swap_slot(), massively simplifying the logic.
>
> I wonder if we need to do the same here?
Nhat Pham June 11, 2024, 6:53 p.m. UTC | #17
On Tue, Jun 11, 2024 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > >         count = p->swap_map[offset];
> > >         VM_BUG_ON(count != SWAP_HAS_CACHE);
> > >         p->swap_map[offset] = 0;
> > > +       clear_bit(offset, p->zeromap);
> >
> > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if
> > we can have a problem, where:
> >
> > 1. The swap entry has its zeromap bit set, and is freed to the swap
> > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is
> > reclaimed from the swap cache, and all the processes referring to it
> > are terminated, which decrements the swap count to 0 (swap_free() ->
> > __swap_entry_free() -> free_swap_slots())
> >
> > 2. The swap slot is then re-used in swap space allocation
> > (add_to_swap()) - its zeromap bit is never cleared.
>
> I do not think this can happen before swap_entry_free() is called.
> Note that when a swap entry is freed to the swap slot cache in
> free_swap_slot(), it is added to cache->slots_ret, not cache->slots.
> The former are swap entries cached to be later freed using
> swap_entry_free().

Ahhh I see. Good point. Then yeah this should be safe from this POV.

>
> >
> > 3. swap_writepage() writes that non-zero page to swap
> >
> > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for
> > the entry is set, so populates a zero page for it.
> >
> > zswap in the past has to carefully invalidate these leftover entries
> > quite carefully. Chengming then move the invalidation point to
> > free_swap_slot(), massively simplifying the logic.
>
> I think the main benefit of moving the invalidation point was avoiding
> leaving the compressed page in zswap until swap_entry_free() is
> called, which will happen only when the swap slot caches are drained.
>

This is true. In this case yeah there's probably not much difference
between clearing the bit here vs in swap_entry_free().

> >
> > I wonder if we need to do the same here?
Nhat Pham June 11, 2024, 7:33 p.m. UTC | #18
On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote:
> In swap_writepage, with this patch you have:
>
>      if (is_folio_zero_filled(folio)) {
>          swap_zeromap_folio_set(folio);
>          folio_unlock(folio);
>          return 0;
>      }
>      swap_zeromap_folio_clear(folio);
>

I was concerned with the swap slot being freed and reused, without
ever being read :) But looks like it has to be properly reset before
being reused, so all is well on that front.

What about the put_swap_folio() -> swap_free_cluster() case - do we
need to handle zeromap bit clearing here too? Looks like it's clearing
the swap_map (i.e returning it directly to the swapfile, allowing
those slots to be reused) here, and I notice that you clear the
zeromap bitmap wherever the swap_map is cleared as well :)

I jumped around the code a bit - in free_cluster() (called by
swap_free_cluster()), there's this chunk:

if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
    swap_cluster_schedule_discard(si, idx);
    return;
}

swap_cluster_schedule_discard() does clear_bit() on the zeromap on the
entire cluster. We also clear_bit() in the work function
swap_do_scheduled_discard() (is this redundant?).

But what if this check is false, i.e the swap device does not have the
SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap
here?
Usama Arif June 12, 2024, 10:42 a.m. UTC | #19
On 11/06/2024 20:33, Nhat Pham wrote:
> On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> In swap_writepage, with this patch you have:
>>
>>       if (is_folio_zero_filled(folio)) {
>>           swap_zeromap_folio_set(folio);
>>           folio_unlock(folio);
>>           return 0;
>>       }
>>       swap_zeromap_folio_clear(folio);
>>
> I was concerned with the swap slot being freed and reused, without
> ever being read :) But looks like it has to be properly reset before
> being reused, so all is well on that front.
>
> What about the put_swap_folio() -> swap_free_cluster() case - do we
> need to handle zeromap bit clearing here too? Looks like it's clearing
> the swap_map (i.e returning it directly to the swapfile, allowing
> those slots to be reused) here, and I notice that you clear the
> zeromap bitmap wherever the swap_map is cleared as well :)
>
> I jumped around the code a bit - in free_cluster() (called by
> swap_free_cluster()), there's this chunk:
>
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
>      (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
>      swap_cluster_schedule_discard(si, idx);
>      return;
> }
>
> swap_cluster_schedule_discard() does clear_bit() on the zeromap on the
> entire cluster. We also clear_bit() in the work function
> swap_do_scheduled_discard() (is this redundant?).
>
> But what if this check is false, i.e the swap device does not have the
> SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap
> here?

Yes, should add in swap_free_cluster as well, will do in next revision. 
Thanks!
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a11c75e897ec..e88563978441 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -299,6 +299,7 @@  struct swap_info_struct {
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct swap_cluster_list free_clusters; /* free clusters list */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
diff --git a/mm/page_io.c b/mm/page_io.c
index a360857cf75d..2cac1e11fb85 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -172,6 +172,82 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 	goto out;
 }
 
+static bool is_folio_page_zero_filled(struct folio *folio, int i)
+{
+	unsigned long *data;
+	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
+	bool ret = false;
+
+	data = kmap_local_folio(folio, i * PAGE_SIZE);
+	if (data[last_pos])
+		goto out;
+	for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) {
+		if (data[pos])
+			goto out;
+	}
+	ret = true;
+out:
+	kunmap_local(data);
+	return ret;
+}
+
+static bool is_folio_zero_filled(struct folio *folio)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		if (!is_folio_page_zero_filled(folio, i))
+			return false;
+	}
+	return true;
+}
+
+static void folio_zero_fill(struct folio *folio)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++)
+		clear_highpage(folio_page(folio, i));
+}
+
+static void swap_zeromap_folio_set(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_entry_t entry;
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		entry = page_swap_entry(folio_page(folio, i));
+		set_bit(swp_offset(entry), sis->zeromap);
+	}
+}
+
+static void swap_zeromap_folio_clear(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_entry_t entry;
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		entry = page_swap_entry(folio_page(folio, i));
+		clear_bit(swp_offset(entry), sis->zeromap);
+	}
+}
+
+static bool swap_zeromap_folio_test(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_entry_t entry;
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		entry = page_swap_entry(folio_page(folio, i));
+		if (!test_bit(swp_offset(entry), sis->zeromap))
+			return false;
+	}
+	return true;
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -195,6 +271,15 @@  int swap_writepage(struct page *page, struct writeback_control *wbc)
 		folio_unlock(folio);
 		return ret;
 	}
+
+	if (is_folio_zero_filled(folio)) {
+		swap_zeromap_folio_set(folio);
+		folio_start_writeback(folio);
+		folio_unlock(folio);
+		folio_end_writeback(folio);
+		return 0;
+	}
+	swap_zeromap_folio_clear(folio);
 	if (zswap_store(folio)) {
 		folio_start_writeback(folio);
 		folio_unlock(folio);
@@ -515,8 +600,11 @@  void swap_read_folio(struct folio *folio, bool synchronous,
 		psi_memstall_enter(&pflags);
 	}
 	delayacct_swapin_start();
-
-	if (zswap_load(folio)) {
+	if (swap_zeromap_folio_test(folio)) {
+		folio_zero_fill(folio);
+		folio_mark_uptodate(folio);
+		folio_unlock(folio);
+	} else if (zswap_load(folio)) {
 		folio_mark_uptodate(folio);
 		folio_unlock(folio);
 	} else if (data_race(sis->flags & SWP_FS_OPS)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f1e559e216bd..90451174fe34 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -453,6 +453,8 @@  static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 		unsigned int idx)
 {
+	unsigned int i;
+
 	/*
 	 * If scan_swap_map_slots() can't find a free cluster, it will check
 	 * si->swap_map directly. To make sure the discarding cluster isn't
@@ -461,6 +463,13 @@  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	 */
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
+	/*
+	 * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
+	 * call on other slots, hence use atomic clear_bit for zeromap instead of the
+	 * non-atomic bitmap_clear.
+	 */
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
 
 	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
 
@@ -482,7 +491,7 @@  static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
 	struct swap_cluster_info *info, *ci;
-	unsigned int idx;
+	unsigned int idx, i;
 
 	info = si->cluster_info;
 
@@ -498,6 +507,8 @@  static void swap_do_scheduled_discard(struct swap_info_struct *si)
 		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		for (i = 0; i < SWAPFILE_CLUSTER; i++)
+			clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
 		unlock_cluster(ci);
 	}
 }
@@ -1336,6 +1347,7 @@  static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	count = p->swap_map[offset];
 	VM_BUG_ON(count != SWAP_HAS_CACHE);
 	p->swap_map[offset] = 0;
+	clear_bit(offset, p->zeromap);
 	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
@@ -2597,6 +2609,7 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
+	bitmap_free(p->zeromap);
 	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -3123,6 +3136,12 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+	p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
+	if (!p->zeromap) {
+		error = -ENOMEM;
+		goto bad_swap_unlock_inode;
+	}
+
 	if (p->bdev && bdev_stable_writes(p->bdev))
 		p->flags |= SWP_STABLE_WRITES;