Message ID | 20240610121820.328876-2-usamaarif642@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: store zero pages to be swapped out in a bitmap | expand |
On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote: > + if (is_folio_zero_filled(folio)) { > + swap_zeromap_folio_set(folio); > + folio_start_writeback(folio); > + folio_unlock(folio); > + folio_end_writeback(folio); What's the point? As far as I can see, the only thing this is going to do is spend a lot of time messing with various counters only to end up with incrementing NR_WRITTEN, which is wrong because you didn't actually write it.
On 10/06/2024 14:07, Matthew Wilcox wrote: > On Mon, Jun 10, 2024 at 01:15:59PM +0100, Usama Arif wrote: >> + if (is_folio_zero_filled(folio)) { >> + swap_zeromap_folio_set(folio); >> + folio_start_writeback(folio); >> + folio_unlock(folio); >> + folio_end_writeback(folio); > What's the point? As far as I can see, the only thing this is going to > do is spend a lot of time messing with various counters only to end up > with incrementing NR_WRITTEN, which is wrong because you didn't actually > write it. > I am guessing what you are suggesting is just do this? if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); folio_unlock(folio); return 0; } This is what I did initially while developing this, but when I started looking into why zswap_store does folio_start_writeback, folio_unlock, folio_end_writeback I found: https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336 "If no I/O is submitted, the filesystem must run end_page_writeback() against the page before returning from writepage." and https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L349 "Note, failure to run either redirty_page_for_writepage() or the combination of set_page_writeback()/end_page_writeback() on a page submitted to writepage will leave the page itself marked clean but it will be tagged as dirty in the radix tree. This incoherency can lead to all sorts of hard-to-debug problems in the filesystem like having dirty inodes at umount and losing written data. " If we have zswap enabled, the zero filled pages (infact any page that is compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly incremented. So the behaviour for NR_WRITTEN does not change in this patch when encountering zero pages with zswap enabled (even if its wrong). This patch just extracts the optimization out from zswap [1] to swap, so that it always runs. In order to fix NR_WRITTEN accounting for zswap, this patch series and any other cases where no I/O is submitted but end_page_writeback is called before returning to writepage, maybe we could add an argument to __folio_end_writeback like below? There are a lot of calls to folio_end_writeback and the behaviour of zeropage with regards to NR_WRITTEN doesnt change with or without this patchseries with zswap enabled, so maybe we could keep this independent of this series? I would be happy to submit this as separate patch if it makes sense. diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 81b2e4128d26..415037f511c2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -3042,7 +3042,7 @@ static void wb_inode_writeback_end(struct bdi_writeback *wb) spin_unlock_irqrestore(&wb->work_lock, flags); } -bool __folio_end_writeback(struct folio *folio) +bool __folio_end_writeback(struct folio *folio, bool nr_written_increment) { long nr = folio_nr_pages(folio); struct address_space *mapping = folio_mapping(folio); @@ -3078,7 +3078,8 @@ bool __folio_end_writeback(struct folio *folio) lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr); zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); - node_stat_mod_folio(folio, NR_WRITTEN, nr); + if (nr_written_increment) + node_stat_mod_folio(folio, NR_WRITTEN, nr); folio_memcg_unlock(folio); return ret; [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote: > I am guessing what you are suggesting is just do this? > > if (is_folio_zero_filled(folio)) { > swap_zeromap_folio_set(folio); > folio_unlock(folio); > return 0; > } Right. > This is what I did initially while developing this, but when I started > looking into why zswap_store does folio_start_writeback, folio_unlock, > folio_end_writeback I found: > > https://elixir.bootlin.com/linux/v6.9.3/source/Documentation/filesystems/locking.rst#L336 > > "If no I/O is submitted, the filesystem must run end_page_writeback() > against the page before returning from writepage." But that's advice to filesystem authors. File pages don't come this way; we only put anonyous pages into swap (except tmpfs). > If we have zswap enabled, the zero filled pages (infact any page that is > compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly > incremented. So the behaviour for NR_WRITTEN does not change in this patch > when encountering zero pages with zswap enabled (even if its wrong). We should fiz zswap too. > In order to fix NR_WRITTEN accounting for zswap, this patch series and any > other cases where no I/O is submitted but end_page_writeback is called > before returning to writepage, maybe we could add an argument to > __folio_end_writeback like below? There are a lot of calls to > folio_end_writeback and the behaviour of zeropage with regards to NR_WRITTEN > doesnt change with or without this patchseries with zswap enabled, so maybe > we could keep this independent of this series? I would be happy to submit > this as separate patch if it makes sense. It makes no sense at all.
On 10/06/2024 15:06, Matthew Wilcox wrote: > On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote: >> I am guessing what you are suggesting is just do this? >> >> if (is_folio_zero_filled(folio)) { >> swap_zeromap_folio_set(folio); >> folio_unlock(folio); >> return 0; >> } > Right. Thanks! Will change to this in the next revision. >> If we have zswap enabled, the zero filled pages (infact any page that is >> compressed), will be saved in zswap_entry and NR_WRITTEN will be wrongly >> incremented. So the behaviour for NR_WRITTEN does not change in this patch >> when encountering zero pages with zswap enabled (even if its wrong). > We should fiz zswap too. > Will send the below diff as a separate patch for zswap: diff --git a/mm/page_io.c b/mm/page_io.c index 2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -281,9 +281,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) } swap_zeromap_folio_clear(folio); if (zswap_store(folio)) { - folio_start_writeback(folio); folio_unlock(folio); - folio_end_writeback(folio); return 0; }
On 10/06/2024 15:14, Usama Arif wrote: > > On 10/06/2024 15:06, Matthew Wilcox wrote: >> On Mon, Jun 10, 2024 at 02:56:09PM +0100, Usama Arif wrote: >>> I am guessing what you are suggesting is just do this? >>> >>> if (is_folio_zero_filled(folio)) { >>> swap_zeromap_folio_set(folio); >>> folio_unlock(folio); >>> return 0; >>> } >> Right. > > Thanks! Will change to this in the next revision. > >>> If we have zswap enabled, the zero filled pages (infact any page >>> that is >>> compressed), will be saved in zswap_entry and NR_WRITTEN will be >>> wrongly >>> incremented. So the behaviour for NR_WRITTEN does not change in this >>> patch >>> when encountering zero pages with zswap enabled (even if its wrong). >> We should fiz zswap too. >> > Will send the below diff as a separate patch for zswap: > > diff --git a/mm/page_io.c b/mm/page_io.c index > 2cac1e11fb85..82796b9f08c7 100644 --- a/mm/page_io.c +++ > b/mm/page_io.c @@ -281,9 +281,7 @@ int swap_writepage(struct page > *page, struct writeback_control *wbc) } > swap_zeromap_folio_clear(folio); if (zswap_store(folio)) { - > folio_start_writeback(folio); folio_unlock(folio); - > folio_end_writeback(folio); return 0; } > My mail client seems to have messed up the diff, but have sent the patch here (https://lore.kernel.org/all/20240610143037.812955-1-usamaarif642@gmail.com/). Tested with test_zswap kselftest.
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote: > > Approximately 10-20% of pages to be swapped out are zero pages [1]. > Rather than reading/writing these pages to flash resulting > in increased I/O and flash wear, a bitmap can be used to mark these > pages as zero at write time, and the pages can be filled at > read time if the bit corresponding to the page is set. > With this patch, NVMe writes in Meta server fleet decreased > by almost 10% with conventional swap setup (zswap disabled). > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/swap.h | 1 + > mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- > mm/swapfile.c | 21 +++++++++- > 3 files changed, 111 insertions(+), 3 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a11c75e897ec..e88563978441 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -299,6 +299,7 @@ struct swap_info_struct { > signed char type; /* strange name for an index */ > unsigned int max; /* extent of the swap_map */ > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > struct swap_cluster_list free_clusters; /* free clusters list */ > unsigned int lowest_bit; /* index of first free in swap_map */ > diff --git a/mm/page_io.c b/mm/page_io.c > index a360857cf75d..2cac1e11fb85 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, > goto out; > } > > +static bool is_folio_page_zero_filled(struct folio *folio, int i) > +{ > + unsigned long *data; > + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; > + bool ret = false; > + > + data = kmap_local_folio(folio, i * PAGE_SIZE); > + if (data[last_pos]) > + goto out; > + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { > + if (data[pos]) > + goto out; > + } > + ret = true; > +out: > + kunmap_local(data); > + return ret; > +} > + > +static bool is_folio_zero_filled(struct folio *folio) > +{ > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + if (!is_folio_page_zero_filled(folio, i)) > + return false; > + } > + return true; > +} Is there any benefit to iterating on the folio in pages (i.e. have both is_folio_zero_filled() and is_folio_page_zero_filled())? Why don't we just kmap the entire folio and check it all at once? > + > +static void folio_zero_fill(struct folio *folio) > +{ > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) > + clear_highpage(folio_page(folio, i)); > +} > + > +static void swap_zeromap_folio_set(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + set_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static void swap_zeromap_folio_clear(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + clear_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static bool swap_zeromap_folio_test(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + if (!test_bit(swp_offset(entry), sis->zeromap)) > + return false; > + } > + return true; > +} > + > /* > * We may have stale swap cache pages in memory: notice > * them here and get rid of the unnecessary final write. > @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) > folio_unlock(folio); > return ret; > } > + > + if (is_folio_zero_filled(folio)) { > + swap_zeromap_folio_set(folio); > + folio_start_writeback(folio); > + folio_unlock(folio); > + folio_end_writeback(folio); > + return 0; > + } > + swap_zeromap_folio_clear(folio); > if (zswap_store(folio)) { > folio_start_writeback(folio); > folio_unlock(folio); > @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, > psi_memstall_enter(&pflags); > } > delayacct_swapin_start(); > - > - if (zswap_load(folio)) { > + if (swap_zeromap_folio_test(folio)) { > + folio_zero_fill(folio); > + folio_mark_uptodate(folio); > + folio_unlock(folio); We don't currently support swapping in large folios, but it is a work in progress, and this will break once we have it. swap_zeromap_folio_test() will return false even if parts of the folio are in fact zero-filled. Then, we will go read those from disk swap, essentially corrupting data. The same problem can happen for zswap, if a large folio being swapped is only partially in zswap. In both cases, it's really easy to miss the problem if we're testing with zswap disabled, with incompressible data, or with non-zero data. Silent data corruption is not very debuggable. I proposed adding handling for this case in zswap here: https://lore.kernel.org/lkml/20240608023654.3513385-1-yosryahmed@google.com/ The discussions there hadn't settled yet, but depending on how it pans out I suspect we will want something similar for the zeromap case as well. > + } else if (zswap_load(folio)) { > folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f1e559e216bd..90451174fe34 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > unsigned int idx) > { > + unsigned int i; > + > /* > * If scan_swap_map_slots() can't find a free cluster, it will check > * si->swap_map directly. To make sure the discarding cluster isn't > @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > */ > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > + /* > + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() > + * call on other slots, hence use atomic clear_bit for zeromap instead of the > + * non-atomic bitmap_clear. > + */ > + for (i = 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); > > cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > > @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > static void swap_do_scheduled_discard(struct swap_info_struct *si) > { > struct swap_cluster_info *info, *ci; > - unsigned int idx; > + unsigned int idx, i; > > info = si->cluster_info; > > @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) > __free_cluster(si, idx); > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > 0, SWAPFILE_CLUSTER); > + for (i = 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); > unlock_cluster(ci); > } > } > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > count = p->swap_map[offset]; > VM_BUG_ON(count != SWAP_HAS_CACHE); > p->swap_map[offset] = 0; > + clear_bit(offset, p->zeromap); > dec_cluster_info_page(p, p->cluster_info, offset); > unlock_cluster(ci); > > @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > free_percpu(p->cluster_next_cpu); > p->cluster_next_cpu = NULL; > vfree(swap_map); > + bitmap_free(p->zeromap); > kvfree(cluster_info); > /* Destroy swap account information */ > swap_cgroup_swapoff(p->type); > @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > goto bad_swap_unlock_inode; > } > > + p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL); > + if (!p->zeromap) { > + error = -ENOMEM; > + goto bad_swap_unlock_inode; > + } > + > if (p->bdev && bdev_stable_writes(p->bdev)) > p->flags |= SWP_STABLE_WRITES; > > -- > 2.43.0 >
On 10/06/2024 18:57, Yosry Ahmed wrote: > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote: >> Approximately 10-20% of pages to be swapped out are zero pages [1]. >> Rather than reading/writing these pages to flash resulting >> in increased I/O and flash wear, a bitmap can be used to mark these >> pages as zero at write time, and the pages can be filled at >> read time if the bit corresponding to the page is set. >> With this patch, NVMe writes in Meta server fleet decreased >> by almost 10% with conventional swap setup (zswap disabled). >> >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ >> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >> --- >> include/linux/swap.h | 1 + >> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- >> mm/swapfile.c | 21 +++++++++- >> 3 files changed, 111 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a11c75e897ec..e88563978441 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -299,6 +299,7 @@ struct swap_info_struct { >> signed char type; /* strange name for an index */ >> unsigned int max; /* extent of the swap_map */ >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >> struct swap_cluster_list free_clusters; /* free clusters list */ >> unsigned int lowest_bit; /* index of first free in swap_map */ >> diff --git a/mm/page_io.c b/mm/page_io.c >> index a360857cf75d..2cac1e11fb85 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, >> goto out; >> } >> >> +static bool is_folio_page_zero_filled(struct folio *folio, int i) >> +{ >> + unsigned long *data; >> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; >> + bool ret = false; >> + >> + data = kmap_local_folio(folio, i * PAGE_SIZE); >> + if (data[last_pos]) >> + goto out; >> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { >> + if (data[pos]) >> + goto out; >> + } >> + ret = true; >> +out: >> + kunmap_local(data); >> + return ret; >> +} >> + >> +static bool is_folio_zero_filled(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + if (!is_folio_page_zero_filled(folio, i)) >> + return false; >> + } >> + return true; >> +} > Is there any benefit to iterating on the folio in pages (i.e. have > both is_folio_zero_filled() and is_folio_page_zero_filled())? Why > don't we just kmap the entire folio and check it all at once? Is there an API to kmap an entire folio? I could just do data = page_address(&folio->page) in above and iterate through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one function, but this just looks much nicer and much more readable? In other places in the kernel, the folio iteration is through pages as well: https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29 https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162 and others as well. >> + >> +static void folio_zero_fill(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) >> + clear_highpage(folio_page(folio, i)); >> +} >> + >> +static void swap_zeromap_folio_set(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + set_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static void swap_zeromap_folio_clear(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + clear_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static bool swap_zeromap_folio_test(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + if (!test_bit(swp_offset(entry), sis->zeromap)) >> + return false; >> + } >> + return true; >> +} >> + >> /* >> * We may have stale swap cache pages in memory: notice >> * them here and get rid of the unnecessary final write. >> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) >> folio_unlock(folio); >> return ret; >> } >> + >> + if (is_folio_zero_filled(folio)) { >> + swap_zeromap_folio_set(folio); >> + folio_start_writeback(folio); >> + folio_unlock(folio); >> + folio_end_writeback(folio); >> + return 0; >> + } >> + swap_zeromap_folio_clear(folio); >> if (zswap_store(folio)) { >> folio_start_writeback(folio); >> folio_unlock(folio); >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, >> psi_memstall_enter(&pflags); >> } >> delayacct_swapin_start(); >> - >> - if (zswap_load(folio)) { >> + if (swap_zeromap_folio_test(folio)) { >> + folio_zero_fill(folio); >> + folio_mark_uptodate(folio); >> + folio_unlock(folio); > We don't currently support swapping in large folios, but it is a work > in progress, and this will break once we have it. > swap_zeromap_folio_test() will return false even if parts of the folio > are in fact zero-filled. Then, we will go read those from disk swap, > essentially corrupting data. So yes, with this patch I tested swap out of large zero folio, but when swapping in it was page by page. My assumption was that swap_read_folio (when support is added) would only pass a large folio that was earlier swapped out as a large folio. So if a zero filled large folio was swapped out, the zeromap will be set for all the pages in that folio, and at large folio swap in (when it is supported), it will see that all the bits in the zeromap for that folio are set, and will just folio_zero_fill. If even a single page in large folio has non-zero data then zeromap will not store it and it will go to either zswap or disk, and at read time as all the bits in zeromap are not set, it will still goto either zswap or disk, so I think this works? Is my assumption wrong that only large folios can be swapped in only if they were swapped out as large? I think this code works in that case. > > The same problem can happen for zswap, if a large folio being swapped > is only partially in zswap. In both cases, it's really easy to miss > the problem if we're testing with zswap disabled, with incompressible > data, or with non-zero data. Silent data corruption is not very > debuggable. > > I proposed adding handling for this case in zswap here: > https://lore.kernel.org/lkml/20240608023654.3513385-1-yosryahmed@google.com/ > > The discussions there hadn't settled yet, but depending on how it pans > out I suspect we will want something similar for the zeromap case as > well. > >> + } else if (zswap_load(folio)) { >> folio_mark_uptodate(folio); >> folio_unlock(folio); >> } else if (data_race(sis->flags & SWP_FS_OPS)) { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f1e559e216bd..90451174fe34 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, >> static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> unsigned int idx) >> { >> + unsigned int i; >> + >> /* >> * If scan_swap_map_slots() can't find a free cluster, it will check >> * si->swap_map directly. To make sure the discarding cluster isn't >> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> */ >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> SWAP_MAP_BAD, SWAPFILE_CLUSTER); >> + /* >> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() >> + * call on other slots, hence use atomic clear_bit for zeromap instead of the >> + * non-atomic bitmap_clear. >> + */ >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> >> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); >> >> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) >> static void swap_do_scheduled_discard(struct swap_info_struct *si) >> { >> struct swap_cluster_info *info, *ci; >> - unsigned int idx; >> + unsigned int idx, i; >> >> info = si->cluster_info; >> >> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) >> __free_cluster(si, idx); >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> 0, SWAPFILE_CLUSTER); >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> unlock_cluster(ci); >> } >> } >> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) >> count = p->swap_map[offset]; >> VM_BUG_ON(count != SWAP_HAS_CACHE); >> p->swap_map[offset] = 0; >> + clear_bit(offset, p->zeromap); >> dec_cluster_info_page(p, p->cluster_info, offset); >> unlock_cluster(ci); >> >> @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) >> free_percpu(p->cluster_next_cpu); >> p->cluster_next_cpu = NULL; >> vfree(swap_map); >> + bitmap_free(p->zeromap); >> kvfree(cluster_info); >> /* Destroy swap account information */ >> swap_cgroup_swapoff(p->type); >> @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) >> goto bad_swap_unlock_inode; >> } >> >> + p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL); >> + if (!p->zeromap) { >> + error = -ENOMEM; >> + goto bad_swap_unlock_inode; >> + } >> + >> if (p->bdev && bdev_stable_writes(p->bdev)) >> p->flags |= SWP_STABLE_WRITES; >> >> -- >> 2.43.0 >>
On Mon, Jun 10, 2024 at 11:36 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > On 10/06/2024 18:57, Yosry Ahmed wrote: > > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> Approximately 10-20% of pages to be swapped out are zero pages [1]. > >> Rather than reading/writing these pages to flash resulting > >> in increased I/O and flash wear, a bitmap can be used to mark these > >> pages as zero at write time, and the pages can be filled at > >> read time if the bit corresponding to the page is set. > >> With this patch, NVMe writes in Meta server fleet decreased > >> by almost 10% with conventional swap setup (zswap disabled). > >> > >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ > >> > >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> > >> --- > >> include/linux/swap.h | 1 + > >> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- > >> mm/swapfile.c | 21 +++++++++- > >> 3 files changed, 111 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/linux/swap.h b/include/linux/swap.h > >> index a11c75e897ec..e88563978441 100644 > >> --- a/include/linux/swap.h > >> +++ b/include/linux/swap.h > >> @@ -299,6 +299,7 @@ struct swap_info_struct { > >> signed char type; /* strange name for an index */ > >> unsigned int max; /* extent of the swap_map */ > >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > >> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ > >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > >> struct swap_cluster_list free_clusters; /* free clusters list */ > >> unsigned int lowest_bit; /* index of first free in swap_map */ > >> diff --git a/mm/page_io.c b/mm/page_io.c > >> index a360857cf75d..2cac1e11fb85 100644 > >> --- a/mm/page_io.c > >> +++ b/mm/page_io.c > >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, > >> goto out; > >> } > >> > >> +static bool is_folio_page_zero_filled(struct folio *folio, int i) > >> +{ > >> + unsigned long *data; > >> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; > >> + bool ret = false; > >> + > >> + data = kmap_local_folio(folio, i * PAGE_SIZE); > >> + if (data[last_pos]) > >> + goto out; > >> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { > >> + if (data[pos]) > >> + goto out; > >> + } > >> + ret = true; > >> +out: > >> + kunmap_local(data); > >> + return ret; > >> +} > >> + > >> +static bool is_folio_zero_filled(struct folio *folio) > >> +{ > >> + unsigned int i; > >> + > >> + for (i = 0; i < folio_nr_pages(folio); i++) { > >> + if (!is_folio_page_zero_filled(folio, i)) > >> + return false; > >> + } > >> + return true; > >> +} > > Is there any benefit to iterating on the folio in pages (i.e. have > > both is_folio_zero_filled() and is_folio_page_zero_filled())? Why > > don't we just kmap the entire folio and check it all at once? > > Is there an API to kmap an entire folio? I thought kmap_local_folio() takes in a 'size' parameter for some reason, my bad. > > I could just do data = page_address(&folio->page) in above and iterate > through folio_nr_pages(folio) * PAGE_SIZE, and do it all in one > function, but this just looks much nicer and much more readable? Yeah if we need to map each page individually, the current code is better. > > In other places in the kernel, the folio iteration is through pages as well: > > https://elixir.bootlin.com/linux/latest/source/arch/csky/abiv2/cacheflush.c#L29 > > https://elixir.bootlin.com/linux/latest/source/arch/mips/mm/cache.c#L162 > > and others as well. [..] > >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, > >> psi_memstall_enter(&pflags); > >> } > >> delayacct_swapin_start(); > >> - > >> - if (zswap_load(folio)) { > >> + if (swap_zeromap_folio_test(folio)) { > >> + folio_zero_fill(folio); > >> + folio_mark_uptodate(folio); > >> + folio_unlock(folio); > > We don't currently support swapping in large folios, but it is a work > > in progress, and this will break once we have it. > > swap_zeromap_folio_test() will return false even if parts of the folio > > are in fact zero-filled. Then, we will go read those from disk swap, > > essentially corrupting data. > > So yes, with this patch I tested swap out of large zero folio, but when > swapping in it was page by page. My assumption was that swap_read_folio > (when support is added) would only pass a large folio that was earlier > swapped out as a large folio. So if a zero filled large folio was > swapped out, the zeromap will be set for all the pages in that folio, > and at large folio swap in (when it is supported), it will see that all > the bits in the zeromap for that folio are set, and will just > folio_zero_fill. > > If even a single page in large folio has non-zero data then zeromap will > not store it and it will go to either zswap or disk, and at read time as > all the bits in zeromap are not set, it will still goto either zswap or > disk, so I think this works? > > Is my assumption wrong that only large folios can be swapped in only if > they were swapped out as large? I think this code works in that case. I think the assumption is incorrect. I think we would just check if contiguous PTEs have contiguous swap entries and swapin the folio as a large folio in this case. It is likely that the swap entries are contiguous because it was swapped out as a large folio, but I don't think it's guaranteed. For example, here is a patch that implements large swapin support for the synchronous swapin case, and I think it just checks that the PTEs have contiguous swap entries: https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ This makes a lot of sense because otherwise you'd have to keep track of how the folios were composed at the time of swapout, to be able to swap the same folios back in.
@@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, >>>> psi_memstall_enter(&pflags); >>>> } >>>> delayacct_swapin_start(); >>>> - >>>> - if (zswap_load(folio)) { >>>> + if (swap_zeromap_folio_test(folio)) { >>>> + folio_zero_fill(folio); >>>> + folio_mark_uptodate(folio); >>>> + folio_unlock(folio); >>> We don't currently support swapping in large folios, but it is a work >>> in progress, and this will break once we have it. >>> swap_zeromap_folio_test() will return false even if parts of the folio >>> are in fact zero-filled. Then, we will go read those from disk swap, >>> essentially corrupting data. >> So yes, with this patch I tested swap out of large zero folio, but when >> swapping in it was page by page. My assumption was that swap_read_folio >> (when support is added) would only pass a large folio that was earlier >> swapped out as a large folio. So if a zero filled large folio was >> swapped out, the zeromap will be set for all the pages in that folio, >> and at large folio swap in (when it is supported), it will see that all >> the bits in the zeromap for that folio are set, and will just >> folio_zero_fill. >> >> If even a single page in large folio has non-zero data then zeromap will >> not store it and it will go to either zswap or disk, and at read time as >> all the bits in zeromap are not set, it will still goto either zswap or >> disk, so I think this works? >> >> Is my assumption wrong that only large folios can be swapped in only if >> they were swapped out as large? I think this code works in that case. > I think the assumption is incorrect. I think we would just check if > contiguous PTEs have contiguous swap entries and swapin the folio as a > large folio in this case. It is likely that the swap entries are > contiguous because it was swapped out as a large folio, but I don't > think it's guaranteed. Yes, makes sense. Thanks for explaining this. > > For example, here is a patch that implements large swapin support for > the synchronous swapin case, and I think it just checks that the PTEs > have contiguous swap entries: > https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ > > This makes a lot of sense because otherwise you'd have to keep track > of how the folios were composed at the time of swapout, to be able to > swap the same folios back in. I think the solution to large folio swap-in for this optimization and zswap is not in swap_read_folio in this patch-series or any call further down the stack, as its too late by the time you reach here, but in Barrys' patch. This needs to happen much earlier when deciding the size of the folio at folio creation in alloc_swap_folio, after Barry checks if (is_zswap_enabled()) goto fallback; once the order is decided, we would need to check the indexes in the zeromap array starting from swap_offset(entry) and ending at swap_offset(entry) + 2^order are set. If no bits are set, or all bits are set, then we allocate large folio. Otherwise, we goto fallback. I think its better to handle this in Barrys patch. I feel this series is close to its final state, i.e. the only diff I have for the next revision is below to remove start/end_writeback for zer_filled case. I will comment on Barrys patch once the I send out the next revision of this. diff --git a/mm/page_io.c b/mm/page_io.c index 2cac1e11fb85..08a3772ddcf7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -274,9 +274,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); - folio_start_writeback(folio); folio_unlock(folio); - folio_end_writeback(folio); return 0; } swap_zeromap_folio_clear(folio);
On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <usamaarif642@gmail.com> wrote: > > @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool > synchronous, > >>>> psi_memstall_enter(&pflags); > >>>> } > >>>> delayacct_swapin_start(); > >>>> - > >>>> - if (zswap_load(folio)) { > >>>> + if (swap_zeromap_folio_test(folio)) { > >>>> + folio_zero_fill(folio); > >>>> + folio_mark_uptodate(folio); > >>>> + folio_unlock(folio); > >>> We don't currently support swapping in large folios, but it is a work > >>> in progress, and this will break once we have it. > >>> swap_zeromap_folio_test() will return false even if parts of the folio > >>> are in fact zero-filled. Then, we will go read those from disk swap, > >>> essentially corrupting data. > >> So yes, with this patch I tested swap out of large zero folio, but when > >> swapping in it was page by page. My assumption was that swap_read_folio > >> (when support is added) would only pass a large folio that was earlier > >> swapped out as a large folio. So if a zero filled large folio was > >> swapped out, the zeromap will be set for all the pages in that folio, > >> and at large folio swap in (when it is supported), it will see that all > >> the bits in the zeromap for that folio are set, and will just > >> folio_zero_fill. > >> > >> If even a single page in large folio has non-zero data then zeromap will > >> not store it and it will go to either zswap or disk, and at read time as > >> all the bits in zeromap are not set, it will still goto either zswap or > >> disk, so I think this works? > >> > >> Is my assumption wrong that only large folios can be swapped in only if > >> they were swapped out as large? I think this code works in that case. > > I think the assumption is incorrect. I think we would just check if > > contiguous PTEs have contiguous swap entries and swapin the folio as a > > large folio in this case. It is likely that the swap entries are > > contiguous because it was swapped out as a large folio, but I don't > > think it's guaranteed. > > Yes, makes sense. Thanks for explaining this. > > > > > For example, here is a patch that implements large swapin support for > > the synchronous swapin case, and I think it just checks that the PTEs > > have contiguous swap entries: > > https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ > > > > This makes a lot of sense because otherwise you'd have to keep track > > of how the folios were composed at the time of swapout, to be able to > > swap the same folios back in. > > I think the solution to large folio swap-in for this optimization and > zswap is not in swap_read_folio in this patch-series or any call further > down the stack, as its too late by the time you reach here, but in > Barrys' patch. This needs to happen much earlier when deciding the size > of the folio at folio creation in alloc_swap_folio, after Barry checks > > if (is_zswap_enabled()) > goto fallback; > > once the order is decided, we would need to check the indexes in the > zeromap array starting from swap_offset(entry) and ending at > swap_offset(entry) + 2^order are set. If no bits are set, or all bits > are set, then we allocate large folio. Otherwise, we goto fallback. > > I think its better to handle this in Barrys patch. I feel this series is > close to its final state, i.e. the only diff I have for the next > revision is below to remove start/end_writeback for zer_filled case. I > will comment on Barrys patch once the I send out the next revision of this. Sorry I did not make myself clearer. I did not mean that you should handle the large folio swapin here. This needs to be handled at a higher level because as you mentioned, a large folio may be partially in the zeromap, zswap, swapcache, disk, etc. What I meant is that we should probably have a debug check to make sure this doesn't go unhandled. For zswap, I am trying to add a warning and fail the swapin operation if a large folio slips through to zswap. We can do something similar here if folks agree this is the right way in the interim: https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/. Maybe I am too paranoid, but I think it's easy to mess up these things when working on large folio swapin imo.
On 11/06/2024 16:42, Yosry Ahmed wrote: > On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <usamaarif642@gmail.com> wrote: >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool >> synchronous, >>>>>> psi_memstall_enter(&pflags); >>>>>> } >>>>>> delayacct_swapin_start(); >>>>>> - >>>>>> - if (zswap_load(folio)) { >>>>>> + if (swap_zeromap_folio_test(folio)) { >>>>>> + folio_zero_fill(folio); >>>>>> + folio_mark_uptodate(folio); >>>>>> + folio_unlock(folio); >>>>> We don't currently support swapping in large folios, but it is a work >>>>> in progress, and this will break once we have it. >>>>> swap_zeromap_folio_test() will return false even if parts of the folio >>>>> are in fact zero-filled. Then, we will go read those from disk swap, >>>>> essentially corrupting data. >>>> So yes, with this patch I tested swap out of large zero folio, but when >>>> swapping in it was page by page. My assumption was that swap_read_folio >>>> (when support is added) would only pass a large folio that was earlier >>>> swapped out as a large folio. So if a zero filled large folio was >>>> swapped out, the zeromap will be set for all the pages in that folio, >>>> and at large folio swap in (when it is supported), it will see that all >>>> the bits in the zeromap for that folio are set, and will just >>>> folio_zero_fill. >>>> >>>> If even a single page in large folio has non-zero data then zeromap will >>>> not store it and it will go to either zswap or disk, and at read time as >>>> all the bits in zeromap are not set, it will still goto either zswap or >>>> disk, so I think this works? >>>> >>>> Is my assumption wrong that only large folios can be swapped in only if >>>> they were swapped out as large? I think this code works in that case. >>> I think the assumption is incorrect. I think we would just check if >>> contiguous PTEs have contiguous swap entries and swapin the folio as a >>> large folio in this case. It is likely that the swap entries are >>> contiguous because it was swapped out as a large folio, but I don't >>> think it's guaranteed. >> Yes, makes sense. Thanks for explaining this. >> >>> For example, here is a patch that implements large swapin support for >>> the synchronous swapin case, and I think it just checks that the PTEs >>> have contiguous swap entries: >>> https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ >>> >>> This makes a lot of sense because otherwise you'd have to keep track >>> of how the folios were composed at the time of swapout, to be able to >>> swap the same folios back in. >> I think the solution to large folio swap-in for this optimization and >> zswap is not in swap_read_folio in this patch-series or any call further >> down the stack, as its too late by the time you reach here, but in >> Barrys' patch. This needs to happen much earlier when deciding the size >> of the folio at folio creation in alloc_swap_folio, after Barry checks >> >> if (is_zswap_enabled()) >> goto fallback; >> >> once the order is decided, we would need to check the indexes in the >> zeromap array starting from swap_offset(entry) and ending at >> swap_offset(entry) + 2^order are set. If no bits are set, or all bits >> are set, then we allocate large folio. Otherwise, we goto fallback. >> >> I think its better to handle this in Barrys patch. I feel this series is >> close to its final state, i.e. the only diff I have for the next >> revision is below to remove start/end_writeback for zer_filled case. I >> will comment on Barrys patch once the I send out the next revision of this. > Sorry I did not make myself clearer. I did not mean that you should > handle the large folio swapin here. This needs to be handled at a > higher level because as you mentioned, a large folio may be partially > in the zeromap, zswap, swapcache, disk, etc. > > What I meant is that we should probably have a debug check to make > sure this doesn't go unhandled. For zswap, I am trying to add a > warning and fail the swapin operation if a large folio slips through > to zswap. We can do something similar here if folks agree this is the > right way in the interim: > https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/. > > Maybe I am too paranoid, but I think it's easy to mess up these things > when working on large folio swapin imo. So there is a difference between zswap and this optimization. In this optimization, if the zeromap is set for all the folio bits, then we should do large folio swapin. There still needs to be a change in Barrys patch in alloc_swap_folio, but apart from that does the below diff over v3 make it better? I will send a v4 with this if it sounds good. diff --git a/mm/page_io.c b/mm/page_io.c index 6400be6e4291..bf01364748a9 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio *folio) } } -static bool swap_zeromap_folio_test(struct folio *folio) +/* + * Return the index of the first subpage which is not zero-filled + * according to swap_info_struct->zeromap. + * If all pages are zero-filled according to zeromap, it will return + * folio_nr_pages(folio). + */ +static long swap_zeromap_folio_test(struct folio *folio) { struct swap_info_struct *sis = swp_swap_info(folio->swap); swp_entry_t entry; - unsigned int i; + long i; for (i = 0; i < folio_nr_pages(folio); i++) { entry = page_swap_entry(folio_page(folio, i)); if (!test_bit(swp_offset(entry), sis->zeromap)) - return false; + return i; } - return true; + return i; } /* @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool synchronous, { struct swap_info_struct *sis = swp_swap_info(folio->swap); bool workingset = folio_test_workingset(folio); + long first_non_zero_page_idx; unsigned long pflags; bool in_thrashing; @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool synchronous, psi_memstall_enter(&pflags); } delayacct_swapin_start(); - if (swap_zeromap_folio_test(folio)) { + first_non_zero_page_idx = swap_zeromap_folio_test(folio); + if (first_non_zero_page_idx == folio_nr_pages(folio)) { folio_zero_fill(folio); folio_mark_uptodate(folio); folio_unlock(folio); + } else if (first_non_zero_page_idx != 0) { + /* + * The case for when only *some* of subpages being swapped-in were recorded + * in sis->zeromap, while the rest are in zswap/disk is currently not handled. + * WARN in this case and return without marking the folio uptodate so that + * an IO error is emitted (e.g. do_swap_page() will sigbus). + */ + WARN_ON_ONCE(1); } else if (zswap_load(folio)) { folio_mark_uptodate(folio); folio_unlock(folio);
[..] > >> I think its better to handle this in Barrys patch. I feel this series is > >> close to its final state, i.e. the only diff I have for the next > >> revision is below to remove start/end_writeback for zer_filled case. I > >> will comment on Barrys patch once the I send out the next revision of this. > > Sorry I did not make myself clearer. I did not mean that you should > > handle the large folio swapin here. This needs to be handled at a > > higher level because as you mentioned, a large folio may be partially > > in the zeromap, zswap, swapcache, disk, etc. > > > > What I meant is that we should probably have a debug check to make > > sure this doesn't go unhandled. For zswap, I am trying to add a > > warning and fail the swapin operation if a large folio slips through > > to zswap. We can do something similar here if folks agree this is the > > right way in the interim: > > https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/. > > > > Maybe I am too paranoid, but I think it's easy to mess up these things > > when working on large folio swapin imo. > > So there is a difference between zswap and this optimization. In this > optimization, if the zeromap is set for all the folio bits, then we > should do large folio swapin. There still needs to be a change in Barrys > patch in alloc_swap_folio, but apart from that does the below diff over > v3 make it better? I will send a v4 with this if it sounds good. > > > diff --git a/mm/page_io.c b/mm/page_io.c > index 6400be6e4291..bf01364748a9 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio > *folio) > } > } > > -static bool swap_zeromap_folio_test(struct folio *folio) > +/* > + * Return the index of the first subpage which is not zero-filled > + * according to swap_info_struct->zeromap. > + * If all pages are zero-filled according to zeromap, it will return > + * folio_nr_pages(folio). > + */ > +static long swap_zeromap_folio_test(struct folio *folio) > { > struct swap_info_struct *sis = swp_swap_info(folio->swap); > swp_entry_t entry; > - unsigned int i; > + long i; Why long? > > for (i = 0; i < folio_nr_pages(folio); i++) { > entry = page_swap_entry(folio_page(folio, i)); > if (!test_bit(swp_offset(entry), sis->zeromap)) > - return false; > + return i; > } > - return true; > + return i; > } > > /* > @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool > synchronous, > { > struct swap_info_struct *sis = swp_swap_info(folio->swap); > bool workingset = folio_test_workingset(folio); > + long first_non_zero_page_idx; > unsigned long pflags; > bool in_thrashing; > > @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool > synchronous, > psi_memstall_enter(&pflags); > } > delayacct_swapin_start(); > - if (swap_zeromap_folio_test(folio)) { > + first_non_zero_page_idx = swap_zeromap_folio_test(folio); > + if (first_non_zero_page_idx == folio_nr_pages(folio)) { > folio_zero_fill(folio); > folio_mark_uptodate(folio); > folio_unlock(folio); > + } else if (first_non_zero_page_idx != 0) { > + /* > + * The case for when only *some* of subpages being > swapped-in were recorded > + * in sis->zeromap, while the rest are in zswap/disk is > currently not handled. > + * WARN in this case and return without marking the > folio uptodate so that > + * an IO error is emitted (e.g. do_swap_page() will sigbus). > + */ > + WARN_ON_ONCE(1); > } else if (zswap_load(folio)) { > folio_mark_uptodate(folio); > folio_unlock(folio); > > This is too much noise for swap_read_folio(). How about adding swap_read_folio_zeromap() that takes care of this and decides whether or not to call folio_mark_uptodate()? -static bool swap_zeromap_folio_test(struct folio *folio) +/* + * Return the index of the first subpage which is not zero-filled according to + * swap_info_struct->zeromap. If all pages are zero-filled according to + * zeromap, it will return folio_nr_pages(folio). + */ +static unsigned int swap_zeromap_folio_test(struct folio *folio) { struct swap_info_struct *sis = swp_swap_info(folio->swap); swp_entry_t entry; @@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio) for (i = 0; i < folio_nr_pages(folio); i++) { entry = page_swap_entry(folio_page(folio, i)); if (!test_bit(swp_offset(entry), sis->zeromap)) - return false; + return i; } - return true; + return i; } /* @@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret) mempool_free(sio, sio_pool); } +static bool swap_read_folio_zeromap(struct folio *folio) +{ + unsigned int idx = swap_zeromap_folio_test(folio); + + if (idx == 0) + return false; + + /* + * Swapping in a large folio that is partially in the zeromap is not + * currently handled. Return true without marking the folio uptodate so + * that an IO error is emitted (e.g. do_swap_page() will sigbus). + */ + if (WARN_ON_ONCE(idx < folio_nr_pages(folio))) + return true; + + folio_zero_fill(folio); + folio_mark_uptodate(folio); + return true +} + static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); @@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous, psi_memstall_enter(&pflags); } delayacct_swapin_start(); - if (swap_zeromap_folio_test(folio)) { - folio_zero_fill(folio); - folio_mark_uptodate(folio); + if (swap_read_folio_zeromap(folio)) { folio_unlock(folio); } else if (zswap_load(folio)) { folio_mark_uptodate(folio);
On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote: > > Approximately 10-20% of pages to be swapped out are zero pages [1]. > Rather than reading/writing these pages to flash resulting > in increased I/O and flash wear, a bitmap can be used to mark these > pages as zero at write time, and the pages can be filled at > read time if the bit corresponding to the page is set. > With this patch, NVMe writes in Meta server fleet decreased > by almost 10% with conventional swap setup (zswap disabled). > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/swap.h | 1 + > mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- > mm/swapfile.c | 21 +++++++++- > 3 files changed, 111 insertions(+), 3 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a11c75e897ec..e88563978441 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -299,6 +299,7 @@ struct swap_info_struct { > signed char type; /* strange name for an index */ > unsigned int max; /* extent of the swap_map */ > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > struct swap_cluster_list free_clusters; /* free clusters list */ > unsigned int lowest_bit; /* index of first free in swap_map */ > diff --git a/mm/page_io.c b/mm/page_io.c > index a360857cf75d..2cac1e11fb85 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, > goto out; > } > > +static bool is_folio_page_zero_filled(struct folio *folio, int i) > +{ > + unsigned long *data; > + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; > + bool ret = false; > + > + data = kmap_local_folio(folio, i * PAGE_SIZE); > + if (data[last_pos]) > + goto out; > + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { > + if (data[pos]) > + goto out; > + } > + ret = true; > +out: > + kunmap_local(data); > + return ret; > +} > + > +static bool is_folio_zero_filled(struct folio *folio) > +{ > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + if (!is_folio_page_zero_filled(folio, i)) > + return false; > + } > + return true; > +} > + > +static void folio_zero_fill(struct folio *folio) > +{ > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) > + clear_highpage(folio_page(folio, i)); > +} > + > +static void swap_zeromap_folio_set(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + set_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static void swap_zeromap_folio_clear(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + clear_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static bool swap_zeromap_folio_test(struct folio *folio) > +{ > + struct swap_info_struct *sis = swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i = 0; i < folio_nr_pages(folio); i++) { > + entry = page_swap_entry(folio_page(folio, i)); > + if (!test_bit(swp_offset(entry), sis->zeromap)) > + return false; > + } > + return true; > +} > + > /* > * We may have stale swap cache pages in memory: notice > * them here and get rid of the unnecessary final write. > @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) > folio_unlock(folio); > return ret; > } > + > + if (is_folio_zero_filled(folio)) { > + swap_zeromap_folio_set(folio); > + folio_start_writeback(folio); > + folio_unlock(folio); > + folio_end_writeback(folio); > + return 0; > + } > + swap_zeromap_folio_clear(folio); > if (zswap_store(folio)) { > folio_start_writeback(folio); > folio_unlock(folio); > @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, > psi_memstall_enter(&pflags); > } > delayacct_swapin_start(); > - > - if (zswap_load(folio)) { > + if (swap_zeromap_folio_test(folio)) { > + folio_zero_fill(folio); > + folio_mark_uptodate(folio); > + folio_unlock(folio); > + } else if (zswap_load(folio)) { > folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f1e559e216bd..90451174fe34 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > unsigned int idx) > { > + unsigned int i; > + > /* > * If scan_swap_map_slots() can't find a free cluster, it will check > * si->swap_map directly. To make sure the discarding cluster isn't > @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > */ > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > + /* > + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() > + * call on other slots, hence use atomic clear_bit for zeromap instead of the > + * non-atomic bitmap_clear. > + */ > + for (i = 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); > > cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > > @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > static void swap_do_scheduled_discard(struct swap_info_struct *si) > { > struct swap_cluster_info *info, *ci; > - unsigned int idx; > + unsigned int idx, i; > > info = si->cluster_info; > > @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) > __free_cluster(si, idx); > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > 0, SWAPFILE_CLUSTER); > + for (i = 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); > unlock_cluster(ci); > } > } > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > count = p->swap_map[offset]; > VM_BUG_ON(count != SWAP_HAS_CACHE); > p->swap_map[offset] = 0; > + clear_bit(offset, p->zeromap); Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if we can have a problem, where: 1. The swap entry has its zeromap bit set, and is freed to the swap slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is reclaimed from the swap cache, and all the processes referring to it are terminated, which decrements the swap count to 0 (swap_free() -> __swap_entry_free() -> free_swap_slots()) 2. The swap slot is then re-used in swap space allocation (add_to_swap()) - its zeromap bit is never cleared. 3. swap_writepage() writes that non-zero page to swap 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for the entry is set, so populates a zero page for it. zswap in the past has to carefully invalidate these leftover entries quite carefully. Chengming then move the invalidation point to free_swap_slot(), massively simplifying the logic. I wonder if we need to do the same here?
On 11/06/2024 18:51, Yosry Ahmed wrote: > [..] >>>> I think its better to handle this in Barrys patch. I feel this series is >>>> close to its final state, i.e. the only diff I have for the next >>>> revision is below to remove start/end_writeback for zer_filled case. I >>>> will comment on Barrys patch once the I send out the next revision of this. >>> Sorry I did not make myself clearer. I did not mean that you should >>> handle the large folio swapin here. This needs to be handled at a >>> higher level because as you mentioned, a large folio may be partially >>> in the zeromap, zswap, swapcache, disk, etc. >>> >>> What I meant is that we should probably have a debug check to make >>> sure this doesn't go unhandled. For zswap, I am trying to add a >>> warning and fail the swapin operation if a large folio slips through >>> to zswap. We can do something similar here if folks agree this is the >>> right way in the interim: >>> https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/. >>> >>> Maybe I am too paranoid, but I think it's easy to mess up these things >>> when working on large folio swapin imo. >> So there is a difference between zswap and this optimization. In this >> optimization, if the zeromap is set for all the folio bits, then we >> should do large folio swapin. There still needs to be a change in Barrys >> patch in alloc_swap_folio, but apart from that does the below diff over >> v3 make it better? I will send a v4 with this if it sounds good. >> >> >> diff --git a/mm/page_io.c b/mm/page_io.c >> index 6400be6e4291..bf01364748a9 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio >> *folio) >> } >> } >> >> -static bool swap_zeromap_folio_test(struct folio *folio) >> +/* >> + * Return the index of the first subpage which is not zero-filled >> + * according to swap_info_struct->zeromap. >> + * If all pages are zero-filled according to zeromap, it will return >> + * folio_nr_pages(folio). >> + */ >> +static long swap_zeromap_folio_test(struct folio *folio) >> { >> struct swap_info_struct *sis = swp_swap_info(folio->swap); >> swp_entry_t entry; >> - unsigned int i; >> + long i; > Why long? folio_nr_pages returns long, but I just checked that folio->_folio_nr_pages is unsigned int, but that will probably be typecasted to long :). I will switch to unsigned int as its not really going to go to long for CONFIG_64BIT >> for (i = 0; i < folio_nr_pages(folio); i++) { >> entry = page_swap_entry(folio_page(folio, i)); >> if (!test_bit(swp_offset(entry), sis->zeromap)) >> - return false; >> + return i; >> } >> - return true; >> + return i; >> } >> >> /* >> @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool >> synchronous, >> { >> struct swap_info_struct *sis = swp_swap_info(folio->swap); >> bool workingset = folio_test_workingset(folio); >> + long first_non_zero_page_idx; >> unsigned long pflags; >> bool in_thrashing; >> >> @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool >> synchronous, >> psi_memstall_enter(&pflags); >> } >> delayacct_swapin_start(); >> - if (swap_zeromap_folio_test(folio)) { >> + first_non_zero_page_idx = swap_zeromap_folio_test(folio); >> + if (first_non_zero_page_idx == folio_nr_pages(folio)) { >> folio_zero_fill(folio); >> folio_mark_uptodate(folio); >> folio_unlock(folio); >> + } else if (first_non_zero_page_idx != 0) { >> + /* >> + * The case for when only *some* of subpages being >> swapped-in were recorded >> + * in sis->zeromap, while the rest are in zswap/disk is >> currently not handled. >> + * WARN in this case and return without marking the >> folio uptodate so that >> + * an IO error is emitted (e.g. do_swap_page() will sigbus). >> + */ >> + WARN_ON_ONCE(1); >> } else if (zswap_load(folio)) { >> folio_mark_uptodate(folio); >> folio_unlock(folio); >> >> > This is too much noise for swap_read_folio(). How about adding > swap_read_folio_zeromap() that takes care of this and decides whether > or not to call folio_mark_uptodate()? Sounds good, will do as below. Thanks! > > -static bool swap_zeromap_folio_test(struct folio *folio) > +/* > + * Return the index of the first subpage which is not zero-filled according to > + * swap_info_struct->zeromap. If all pages are zero-filled according to > + * zeromap, it will return folio_nr_pages(folio). > + */ > +static unsigned int swap_zeromap_folio_test(struct folio *folio) > { > struct swap_info_struct *sis = swp_swap_info(folio->swap); > swp_entry_t entry; > @@ -243,9 +248,9 @@ static bool swap_zeromap_folio_test(struct folio *folio) > for (i = 0; i < folio_nr_pages(folio); i++) { > entry = page_swap_entry(folio_page(folio, i)); > if (!test_bit(swp_offset(entry), sis->zeromap)) > - return false; > + return i; > } > - return true; > + return i; > } > > /* > @@ -511,6 +516,25 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > mempool_free(sio, sio_pool); > } > > +static bool swap_read_folio_zeromap(struct folio *folio) > +{ > + unsigned int idx = swap_zeromap_folio_test(folio); > + > + if (idx == 0) > + return false; > + > + /* > + * Swapping in a large folio that is partially in the zeromap is not > + * currently handled. Return true without marking the folio uptodate so > + * that an IO error is emitted (e.g. do_swap_page() will sigbus). > + */ > + if (WARN_ON_ONCE(idx < folio_nr_pages(folio))) > + return true; > + > + folio_zero_fill(folio); > + folio_mark_uptodate(folio); > + return true > +} > + > static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) > { > struct swap_info_struct *sis = swp_swap_info(folio->swap); > @@ -600,9 +624,7 @@ void swap_read_folio(struct folio *folio, bool synchronous, > psi_memstall_enter(&pflags); > } > delayacct_swapin_start(); > - if (swap_zeromap_folio_test(folio)) { > - folio_zero_fill(folio); > - folio_mark_uptodate(folio); > + if (swap_read_folio_zeromap(folio)) { > folio_unlock(folio); > } else if (zswap_load(folio)) { > folio_mark_uptodate(folio);
[..] > > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > > count = p->swap_map[offset]; > > VM_BUG_ON(count != SWAP_HAS_CACHE); > > p->swap_map[offset] = 0; > > + clear_bit(offset, p->zeromap); > > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if > we can have a problem, where: > > 1. The swap entry has its zeromap bit set, and is freed to the swap > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is > reclaimed from the swap cache, and all the processes referring to it > are terminated, which decrements the swap count to 0 (swap_free() -> > __swap_entry_free() -> free_swap_slots()) > > 2. The swap slot is then re-used in swap space allocation > (add_to_swap()) - its zeromap bit is never cleared. I do not think this can happen before swap_entry_free() is called. Note that when a swap entry is freed to the swap slot cache in free_swap_slot(), it is added to cache->slots_ret, not cache->slots. The former are swap entries cached to be later freed using swap_entry_free(). > > 3. swap_writepage() writes that non-zero page to swap > > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for > the entry is set, so populates a zero page for it. > > zswap in the past has to carefully invalidate these leftover entries > quite carefully. Chengming then move the invalidation point to > free_swap_slot(), massively simplifying the logic. I think the main benefit of moving the invalidation point was avoiding leaving the compressed page in zswap until swap_entry_free() is called, which will happen only when the swap slot caches are drained. > > I wonder if we need to do the same here?
On 11/06/2024 19:39, Nhat Pham wrote: > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif <usamaarif642@gmail.com> wrote: >> Approximately 10-20% of pages to be swapped out are zero pages [1]. >> Rather than reading/writing these pages to flash resulting >> in increased I/O and flash wear, a bitmap can be used to mark these >> pages as zero at write time, and the pages can be filled at >> read time if the bit corresponding to the page is set. >> With this patch, NVMe writes in Meta server fleet decreased >> by almost 10% with conventional swap setup (zswap disabled). >> >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ >> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >> --- >> include/linux/swap.h | 1 + >> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- >> mm/swapfile.c | 21 +++++++++- >> 3 files changed, 111 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a11c75e897ec..e88563978441 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -299,6 +299,7 @@ struct swap_info_struct { >> signed char type; /* strange name for an index */ >> unsigned int max; /* extent of the swap_map */ >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >> struct swap_cluster_list free_clusters; /* free clusters list */ >> unsigned int lowest_bit; /* index of first free in swap_map */ >> diff --git a/mm/page_io.c b/mm/page_io.c >> index a360857cf75d..2cac1e11fb85 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, >> goto out; >> } >> >> +static bool is_folio_page_zero_filled(struct folio *folio, int i) >> +{ >> + unsigned long *data; >> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; >> + bool ret = false; >> + >> + data = kmap_local_folio(folio, i * PAGE_SIZE); >> + if (data[last_pos]) >> + goto out; >> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { >> + if (data[pos]) >> + goto out; >> + } >> + ret = true; >> +out: >> + kunmap_local(data); >> + return ret; >> +} >> + >> +static bool is_folio_zero_filled(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + if (!is_folio_page_zero_filled(folio, i)) >> + return false; >> + } >> + return true; >> +} >> + >> +static void folio_zero_fill(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) >> + clear_highpage(folio_page(folio, i)); >> +} >> + >> +static void swap_zeromap_folio_set(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + set_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static void swap_zeromap_folio_clear(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + clear_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static bool swap_zeromap_folio_test(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + if (!test_bit(swp_offset(entry), sis->zeromap)) >> + return false; >> + } >> + return true; >> +} >> + >> /* >> * We may have stale swap cache pages in memory: notice >> * them here and get rid of the unnecessary final write. >> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) >> folio_unlock(folio); >> return ret; >> } >> + >> + if (is_folio_zero_filled(folio)) { >> + swap_zeromap_folio_set(folio); >> + folio_start_writeback(folio); >> + folio_unlock(folio); >> + folio_end_writeback(folio); >> + return 0; >> + } >> + swap_zeromap_folio_clear(folio); >> if (zswap_store(folio)) { >> folio_start_writeback(folio); >> folio_unlock(folio); >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, >> psi_memstall_enter(&pflags); >> } >> delayacct_swapin_start(); >> - >> - if (zswap_load(folio)) { >> + if (swap_zeromap_folio_test(folio)) { >> + folio_zero_fill(folio); >> + folio_mark_uptodate(folio); >> + folio_unlock(folio); >> + } else if (zswap_load(folio)) { >> folio_mark_uptodate(folio); >> folio_unlock(folio); >> } else if (data_race(sis->flags & SWP_FS_OPS)) { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f1e559e216bd..90451174fe34 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, >> static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> unsigned int idx) >> { >> + unsigned int i; >> + >> /* >> * If scan_swap_map_slots() can't find a free cluster, it will check >> * si->swap_map directly. To make sure the discarding cluster isn't >> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> */ >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> SWAP_MAP_BAD, SWAPFILE_CLUSTER); >> + /* >> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() >> + * call on other slots, hence use atomic clear_bit for zeromap instead of the >> + * non-atomic bitmap_clear. >> + */ >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> >> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); >> >> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) >> static void swap_do_scheduled_discard(struct swap_info_struct *si) >> { >> struct swap_cluster_info *info, *ci; >> - unsigned int idx; >> + unsigned int idx, i; >> >> info = si->cluster_info; >> >> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) >> __free_cluster(si, idx); >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> 0, SWAPFILE_CLUSTER); >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> unlock_cluster(ci); >> } >> } >> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) >> count = p->swap_map[offset]; >> VM_BUG_ON(count != SWAP_HAS_CACHE); >> p->swap_map[offset] = 0; >> + clear_bit(offset, p->zeromap); > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if > we can have a problem, where: > > 1. The swap entry has its zeromap bit set, and is freed to the swap > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is > reclaimed from the swap cache, and all the processes referring to it > are terminated, which decrements the swap count to 0 (swap_free() -> > __swap_entry_free() -> free_swap_slots()) > > 2. The swap slot is then re-used in swap space allocation > (add_to_swap()) - its zeromap bit is never cleared. > > 3. swap_writepage() writes that non-zero page to swap In swap_writepage, with this patch you have: if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); folio_unlock(folio); return 0; } swap_zeromap_folio_clear(folio); i.e. if folio is not zero filled, swap_zeromap_folio_clear will be called and the bit is cleared, so I think it would take care of this scenario? swap_read_folio will see the bit cleared in step 4. > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for > the entry is set, so populates a zero page for it. > > zswap in the past has to carefully invalidate these leftover entries > quite carefully. Chengming then move the invalidation point to > free_swap_slot(), massively simplifying the logic. > > I wonder if we need to do the same here?
On Tue, Jun 11, 2024 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > > > count = p->swap_map[offset]; > > > VM_BUG_ON(count != SWAP_HAS_CACHE); > > > p->swap_map[offset] = 0; > > > + clear_bit(offset, p->zeromap); > > > > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if > > we can have a problem, where: > > > > 1. The swap entry has its zeromap bit set, and is freed to the swap > > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is > > reclaimed from the swap cache, and all the processes referring to it > > are terminated, which decrements the swap count to 0 (swap_free() -> > > __swap_entry_free() -> free_swap_slots()) > > > > 2. The swap slot is then re-used in swap space allocation > > (add_to_swap()) - its zeromap bit is never cleared. > > I do not think this can happen before swap_entry_free() is called. > Note that when a swap entry is freed to the swap slot cache in > free_swap_slot(), it is added to cache->slots_ret, not cache->slots. > The former are swap entries cached to be later freed using > swap_entry_free(). Ahhh I see. Good point. Then yeah this should be safe from this POV. > > > > > 3. swap_writepage() writes that non-zero page to swap > > > > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for > > the entry is set, so populates a zero page for it. > > > > zswap in the past has to carefully invalidate these leftover entries > > quite carefully. Chengming then move the invalidation point to > > free_swap_slot(), massively simplifying the logic. > > I think the main benefit of moving the invalidation point was avoiding > leaving the compressed page in zswap until swap_entry_free() is > called, which will happen only when the swap slot caches are drained. > This is true. In this case yeah there's probably not much difference between clearing the bit here vs in swap_entry_free(). > > > > I wonder if we need to do the same here?
On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote: > In swap_writepage, with this patch you have: > > if (is_folio_zero_filled(folio)) { > swap_zeromap_folio_set(folio); > folio_unlock(folio); > return 0; > } > swap_zeromap_folio_clear(folio); > I was concerned with the swap slot being freed and reused, without ever being read :) But looks like it has to be properly reset before being reused, so all is well on that front. What about the put_swap_folio() -> swap_free_cluster() case - do we need to handle zeromap bit clearing here too? Looks like it's clearing the swap_map (i.e returning it directly to the swapfile, allowing those slots to be reused) here, and I notice that you clear the zeromap bitmap wherever the swap_map is cleared as well :) I jumped around the code a bit - in free_cluster() (called by swap_free_cluster()), there's this chunk: if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == (SWP_WRITEOK | SWP_PAGE_DISCARD)) { swap_cluster_schedule_discard(si, idx); return; } swap_cluster_schedule_discard() does clear_bit() on the zeromap on the entire cluster. We also clear_bit() in the work function swap_do_scheduled_discard() (is this redundant?). But what if this check is false, i.e the swap device does not have the SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap here?
On 11/06/2024 20:33, Nhat Pham wrote: > On Tue, Jun 11, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote: >> In swap_writepage, with this patch you have: >> >> if (is_folio_zero_filled(folio)) { >> swap_zeromap_folio_set(folio); >> folio_unlock(folio); >> return 0; >> } >> swap_zeromap_folio_clear(folio); >> > I was concerned with the swap slot being freed and reused, without > ever being read :) But looks like it has to be properly reset before > being reused, so all is well on that front. > > What about the put_swap_folio() -> swap_free_cluster() case - do we > need to handle zeromap bit clearing here too? Looks like it's clearing > the swap_map (i.e returning it directly to the swapfile, allowing > those slots to be reused) here, and I notice that you clear the > zeromap bitmap wherever the swap_map is cleared as well :) > > I jumped around the code a bit - in free_cluster() (called by > swap_free_cluster()), there's this chunk: > > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > swap_cluster_schedule_discard(si, idx); > return; > } > > swap_cluster_schedule_discard() does clear_bit() on the zeromap on the > entire cluster. We also clear_bit() in the work function > swap_do_scheduled_discard() (is this redundant?). > > But what if this check is false, i.e the swap device does not have the > SWP_PAGE_DISCARD flag set? Are we not clearing the bits in the zeromap > here? Yes, should add in swap_free_cluster as well, will do in next revision. Thanks!
diff --git a/include/linux/swap.h b/include/linux/swap.h index a11c75e897ec..e88563978441 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -299,6 +299,7 @@ struct swap_info_struct { signed char type; /* strange name for an index */ unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct swap_cluster_list free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ diff --git a/mm/page_io.c b/mm/page_io.c index a360857cf75d..2cac1e11fb85 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, goto out; } +static bool is_folio_page_zero_filled(struct folio *folio, int i) +{ + unsigned long *data; + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; + bool ret = false; + + data = kmap_local_folio(folio, i * PAGE_SIZE); + if (data[last_pos]) + goto out; + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { + if (data[pos]) + goto out; + } + ret = true; +out: + kunmap_local(data); + return ret; +} + +static bool is_folio_zero_filled(struct folio *folio) +{ + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + if (!is_folio_page_zero_filled(folio, i)) + return false; + } + return true; +} + +static void folio_zero_fill(struct folio *folio) +{ + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) + clear_highpage(folio_page(folio, i)); +} + +static void swap_zeromap_folio_set(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + set_bit(swp_offset(entry), sis->zeromap); + } +} + +static void swap_zeromap_folio_clear(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + clear_bit(swp_offset(entry), sis->zeromap); + } +} + +static bool swap_zeromap_folio_test(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + if (!test_bit(swp_offset(entry), sis->zeromap)) + return false; + } + return true; +} + /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) folio_unlock(folio); return ret; } + + if (is_folio_zero_filled(folio)) { + swap_zeromap_folio_set(folio); + folio_start_writeback(folio); + folio_unlock(folio); + folio_end_writeback(folio); + return 0; + } + swap_zeromap_folio_clear(folio); if (zswap_store(folio)) { folio_start_writeback(folio); folio_unlock(folio); @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, psi_memstall_enter(&pflags); } delayacct_swapin_start(); - - if (zswap_load(folio)) { + if (swap_zeromap_folio_test(folio)) { + folio_zero_fill(folio); + folio_mark_uptodate(folio); + folio_unlock(folio); + } else if (zswap_load(folio)) { folio_mark_uptodate(folio); folio_unlock(folio); } else if (data_race(sis->flags & SWP_FS_OPS)) { diff --git a/mm/swapfile.c b/mm/swapfile.c index f1e559e216bd..90451174fe34 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, static void swap_cluster_schedule_discard(struct swap_info_struct *si, unsigned int idx) { + unsigned int i; + /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, */ memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); + /* + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() + * call on other slots, hence use atomic clear_bit for zeromap instead of the + * non-atomic bitmap_clear. + */ + for (i = 0; i < SWAPFILE_CLUSTER; i++) + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) static void swap_do_scheduled_discard(struct swap_info_struct *si) { struct swap_cluster_info *info, *ci; - unsigned int idx; + unsigned int idx, i; info = si->cluster_info; @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) __free_cluster(si, idx); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); + for (i = 0; i < SWAPFILE_CLUSTER; i++) + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); unlock_cluster(ci); } } @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) count = p->swap_map[offset]; VM_BUG_ON(count != SWAP_HAS_CACHE); p->swap_map[offset] = 0; + clear_bit(offset, p->zeromap); dec_cluster_info_page(p, p->cluster_info, offset); unlock_cluster(ci); @@ -2597,6 +2609,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) free_percpu(p->cluster_next_cpu); p->cluster_next_cpu = NULL; vfree(swap_map); + bitmap_free(p->zeromap); kvfree(cluster_info); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); @@ -3123,6 +3136,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } + p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL); + if (!p->zeromap) { + error = -ENOMEM; + goto bad_swap_unlock_inode; + } + if (p->bdev && bdev_stable_writes(p->bdev)) p->flags |= SWP_STABLE_WRITES;
Approximately 10-20% of pages to be swapped out are zero pages [1]. Rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. With this patch, NVMe writes in Meta server fleet decreased by almost 10% with conventional swap setup (zswap disabled). [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ Signed-off-by: Usama Arif <usamaarif642@gmail.com> --- include/linux/swap.h | 1 + mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- mm/swapfile.c | 21 +++++++++- 3 files changed, 111 insertions(+), 3 deletions(-)