Message ID | 20241018105026.2521366-2-usamaarif642@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,1/4] mm/zswap: skip swapcache for swapping in zswap pages | expand |
On Fri, Oct 18, 2024 at 3:50 AM Usama Arif <usamaarif642@gmail.com> wrote: > > As mentioned in [1], there is a significant improvement in no > readahead swapin performance for super fast devices when skipping > swapcache. FYI, Kairui was working on removing the swapcache bypass completely, which I think may be a good thing: https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/ However, that series is old, since before the large folio swapin support, so I am not sure if/when he intends to refresh it. In his approach there is still a swapin path for synchronous swapin though, which we can still utilize for zswap. > > With large folio zswapin support added in later patches, this will also > mean this path will also act as "readahead" by swapping in multiple > pages into large folios. further improving performance. > > [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93 > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/zswap.h | 6 ++++++ > mm/memory.c | 3 ++- > mm/page_io.c | 1 - > mm/zswap.c | 46 +++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 54 insertions(+), 2 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index d961ead91bf1..e418d75db738 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -27,6 +27,7 @@ struct zswap_lruvec_state { > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > +bool zswap_present_test(swp_entry_t swp, int nr_pages); > void zswap_invalidate(swp_entry_t swp); > int zswap_swapon(int type, unsigned long nr_pages); > void zswap_swapoff(int type); > @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio) > return false; > } > > +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages) > +{ > + return false; > +} > + > static inline void zswap_invalidate(swp_entry_t swp) {} > static inline int zswap_swapon(int type, unsigned long nr_pages) > { > diff --git a/mm/memory.c b/mm/memory.c > index 03e5452dd0c0..49d243131169 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > swapcache = folio; > > if (!folio) { > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > + if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) || > + zswap_present_test(entry, 1)) && > __swap_count(entry) == 1) { > /* skip swapcache */ > folio = alloc_swap_folio(vmf); > diff --git a/mm/page_io.c b/mm/page_io.c > index 4aa34862676f..2a15b197968a 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) > unsigned long pflags; > bool in_thrashing; > > - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); > > diff --git a/mm/zswap.c b/mm/zswap.c > index 7f00cc918e7c..f4b03071b2fb 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > return ret; > } > > +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > +{ > + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; I am not sure I understand what we are looking for here. When does this return false? Aren't the zswap trees always allocated during swapon? > +} > + > +/* Returns true if the entire folio is in zswap */ There isn't really a folio at this point, maybe "Returns true if the entire range is in zswap"? Also, this is racy because an exclusive load, invalidation, or writeback can cause an entry to be removed from zswap. Under what conditions is this safe? The caller can probably guarantee we don't race against invalidation, but can we guarantee that concurrent exclusive loads or writebacks don't happen? If the answer is yes, this needs to be properly documented. > +bool zswap_present_test(swp_entry_t swp, int nr_pages) > +{ > + pgoff_t offset = swp_offset(swp), tree_max_idx; > + int max_idx = 0, i = 0, tree_offset = 0; > + unsigned int type = swp_type(swp); > + struct zswap_entry *entry = NULL; > + struct xarray *tree; > + > + while (i < nr_pages) { > + tree_offset = offset + i; > + /* Check if the tree exists. */ > + if (!swp_offset_in_zswap(type, tree_offset)) > + return false; > + > + tree = swap_zswap_tree(swp_entry(type, tree_offset)); > + XA_STATE(xas, tree, tree_offset); Please do not mix declarations with code. > + > + tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ? > + ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) : > + ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES); Does this work if we always use ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES)? > + max_idx = min(offset + nr_pages, tree_max_idx) - 1; > + rcu_read_lock(); > + xas_for_each(&xas, entry, max_idx) { > + if (xas_retry(&xas, entry)) > + continue; > + i++; > + } > + rcu_read_unlock(); > + /* > + * If xas_for_each exits because entry is NULL and nit: add () to the end of function names (i.e. xas_for_each()) > + * the number of entries checked are less then max idx, s/then/than > + * then zswap does not contain the entire folio. > + */ > + if (!entry && offset + i <= max_idx) > + return false; > + } > + > + return true; > +} > + > bool zswap_load(struct folio *folio) > { > swp_entry_t swp = folio->swap; > -- > 2.43.5 >
On Fri, Oct 18, 2024 at 3:50 AM Usama Arif <usamaarif642@gmail.com> wrote: > > As mentioned in [1], there is a significant improvement in no > readahead swapin performance for super fast devices when skipping > swapcache. > > With large folio zswapin support added in later patches, this will also > mean this path will also act as "readahead" by swapping in multiple > pages into large folios. further improving performance. > > [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93 > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/zswap.h | 6 ++++++ > mm/memory.c | 3 ++- > mm/page_io.c | 1 - > mm/zswap.c | 46 +++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 54 insertions(+), 2 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index d961ead91bf1..e418d75db738 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -27,6 +27,7 @@ struct zswap_lruvec_state { > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > +bool zswap_present_test(swp_entry_t swp, int nr_pages); > void zswap_invalidate(swp_entry_t swp); > int zswap_swapon(int type, unsigned long nr_pages); > void zswap_swapoff(int type); > @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio) > return false; > } > > +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages) > +{ > + return false; > +} > + > static inline void zswap_invalidate(swp_entry_t swp) {} > static inline int zswap_swapon(int type, unsigned long nr_pages) > { > diff --git a/mm/memory.c b/mm/memory.c > index 03e5452dd0c0..49d243131169 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > swapcache = folio; > > if (!folio) { > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > + if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) || > + zswap_present_test(entry, 1)) && > __swap_count(entry) == 1) { > /* skip swapcache */ > folio = alloc_swap_folio(vmf); > diff --git a/mm/page_io.c b/mm/page_io.c > index 4aa34862676f..2a15b197968a 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) > unsigned long pflags; > bool in_thrashing; > > - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); > > diff --git a/mm/zswap.c b/mm/zswap.c > index 7f00cc918e7c..f4b03071b2fb 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > return ret; > } > > +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > +{ > + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > +} > + > +/* Returns true if the entire folio is in zswap */ > +bool zswap_present_test(swp_entry_t swp, int nr_pages) Also, did you check how the performance changes if we bring back the bitmap of present entries (i.e. what used to be frontswap's bitmap) instead of the tree lookups here? > +{ > + pgoff_t offset = swp_offset(swp), tree_max_idx; > + int max_idx = 0, i = 0, tree_offset = 0; > + unsigned int type = swp_type(swp); > + struct zswap_entry *entry = NULL; > + struct xarray *tree; > + > + while (i < nr_pages) { > + tree_offset = offset + i; > + /* Check if the tree exists. */ > + if (!swp_offset_in_zswap(type, tree_offset)) > + return false; > + > + tree = swap_zswap_tree(swp_entry(type, tree_offset)); > + XA_STATE(xas, tree, tree_offset); > + > + tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ? > + ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) : > + ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES); > + max_idx = min(offset + nr_pages, tree_max_idx) - 1; > + rcu_read_lock(); > + xas_for_each(&xas, entry, max_idx) { > + if (xas_retry(&xas, entry)) > + continue; > + i++; > + } > + rcu_read_unlock(); > + /* > + * If xas_for_each exits because entry is NULL and > + * the number of entries checked are less then max idx, > + * then zswap does not contain the entire folio. > + */ > + if (!entry && offset + i <= max_idx) > + return false; > + } > + > + return true; > +} > + > bool zswap_load(struct folio *folio) > { > swp_entry_t swp = folio->swap; > -- > 2.43.5 >
On 21/10/2024 22:09, Yosry Ahmed wrote: > On Fri, Oct 18, 2024 at 3:50 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> As mentioned in [1], there is a significant improvement in no >> readahead swapin performance for super fast devices when skipping >> swapcache. > > FYI, Kairui was working on removing the swapcache bypass completely, > which I think may be a good thing: > https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/ > > However, that series is old, since before the large folio swapin > support, so I am not sure if/when he intends to refresh it. > > In his approach there is still a swapin path for synchronous swapin > though, which we can still utilize for zswap. > >> >> With large folio zswapin support added in later patches, this will also >> mean this path will also act as "readahead" by swapping in multiple >> pages into large folios. further improving performance. >> >> [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93 >> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >> --- >> include/linux/zswap.h | 6 ++++++ >> mm/memory.c | 3 ++- >> mm/page_io.c | 1 - >> mm/zswap.c | 46 +++++++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 54 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/zswap.h b/include/linux/zswap.h >> index d961ead91bf1..e418d75db738 100644 >> --- a/include/linux/zswap.h >> +++ b/include/linux/zswap.h >> @@ -27,6 +27,7 @@ struct zswap_lruvec_state { >> unsigned long zswap_total_pages(void); >> bool zswap_store(struct folio *folio); >> bool zswap_load(struct folio *folio); >> +bool zswap_present_test(swp_entry_t swp, int nr_pages); >> void zswap_invalidate(swp_entry_t swp); >> int zswap_swapon(int type, unsigned long nr_pages); >> void zswap_swapoff(int type); >> @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio) >> return false; >> } >> >> +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages) >> +{ >> + return false; >> +} >> + >> static inline void zswap_invalidate(swp_entry_t swp) {} >> static inline int zswap_swapon(int type, unsigned long nr_pages) >> { >> diff --git a/mm/memory.c b/mm/memory.c >> index 03e5452dd0c0..49d243131169 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> swapcache = folio; >> >> if (!folio) { >> - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> + if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) || >> + zswap_present_test(entry, 1)) && >> __swap_count(entry) == 1) { >> /* skip swapcache */ >> folio = alloc_swap_folio(vmf); >> diff --git a/mm/page_io.c b/mm/page_io.c >> index 4aa34862676f..2a15b197968a 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) >> unsigned long pflags; >> bool in_thrashing; >> >> - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); >> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); >> VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 7f00cc918e7c..f4b03071b2fb 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) >> return ret; >> } >> >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) >> +{ >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > > I am not sure I understand what we are looking for here. When does > this return false? Aren't the zswap trees always allocated during > swapon? > Hi Yosry, Thanks for the review! It becomes useful in patch 3 when trying to determine if a large folio can be allocated. For e.g. if the swap entry is the last entry of the last tree, and 1M folios are enabled (nr_pages = 256), then the while loop in zswap_present_test will try to access a tree that doesn't exist from the 2nd 4K page onwards if we dont have this check in zswap_present_test. >> +} >> + >> +/* Returns true if the entire folio is in zswap */ > > There isn't really a folio at this point, maybe "Returns true if the > entire range is in zswap"? Will change, Thanks! > > Also, this is racy because an exclusive load, invalidation, or > writeback can cause an entry to be removed from zswap. Under what > conditions is this safe? The caller can probably guarantee we don't > race against invalidation, but can we guarantee that concurrent > exclusive loads or writebacks don't happen? > > If the answer is yes, this needs to be properly documented. swapcache_prepare should stop things from becoming racy. lets say trying to swapin a mTHP of size 32 pages: - T1 is doing do_swap_page, T2 is doing zswap_writeback. - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entry, nr_pages) in do_swap_page is not yet called. - T2 - zswap_writeback_entry starts and lets say writes page 2 to swap. it calls __read_swap_cache_async -> swapcache_prepare increments swap_map count, writes page to swap. - T1 - swapcache_prepare is then called and fails and then there will be a pagefault again for it. I will try and document this better. > >> +bool zswap_present_test(swp_entry_t swp, int nr_pages) >> +{ >> + pgoff_t offset = swp_offset(swp), tree_max_idx; >> + int max_idx = 0, i = 0, tree_offset = 0; >> + unsigned int type = swp_type(swp); >> + struct zswap_entry *entry = NULL; >> + struct xarray *tree; >> + >> + while (i < nr_pages) { >> + tree_offset = offset + i; >> + /* Check if the tree exists. */ >> + if (!swp_offset_in_zswap(type, tree_offset)) >> + return false; >> + >> + tree = swap_zswap_tree(swp_entry(type, tree_offset)); >> + XA_STATE(xas, tree, tree_offset); > > Please do not mix declarations with code. > >> + >> + tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ? >> + ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) : >> + ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES); > > Does this work if we always use ALIGN(tree_offset + 1, > SWAP_ADDRESS_SPACE_PAGES)? Yes, I think max_idx = min(offset + nr_pages, ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES)) - 1; will work. I will test it out, Thanks! > >> + max_idx = min(offset + nr_pages, tree_max_idx) - 1; >> + rcu_read_lock(); >> + xas_for_each(&xas, entry, max_idx) { >> + if (xas_retry(&xas, entry)) >> + continue; >> + i++; >> + } >> + rcu_read_unlock(); >> + /* >> + * If xas_for_each exits because entry is NULL and > > nit: add () to the end of function names (i.e. xas_for_each()) > >> + * the number of entries checked are less then max idx, > > s/then/than > >> + * then zswap does not contain the entire folio. >> + */ >> + if (!entry && offset + i <= max_idx) >> + return false; >> + } >> + >> + return true; >> +} >> + >> bool zswap_load(struct folio *folio) >> { >> swp_entry_t swp = folio->swap; >> -- >> 2.43.5 >>
On 21/10/2024 22:11, Yosry Ahmed wrote: > On Fri, Oct 18, 2024 at 3:50 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> As mentioned in [1], there is a significant improvement in no >> readahead swapin performance for super fast devices when skipping >> swapcache. >> >> With large folio zswapin support added in later patches, this will also >> mean this path will also act as "readahead" by swapping in multiple >> pages into large folios. further improving performance. >> >> [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93 >> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >> --- >> include/linux/zswap.h | 6 ++++++ >> mm/memory.c | 3 ++- >> mm/page_io.c | 1 - >> mm/zswap.c | 46 +++++++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 54 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/zswap.h b/include/linux/zswap.h >> index d961ead91bf1..e418d75db738 100644 >> --- a/include/linux/zswap.h >> +++ b/include/linux/zswap.h >> @@ -27,6 +27,7 @@ struct zswap_lruvec_state { >> unsigned long zswap_total_pages(void); >> bool zswap_store(struct folio *folio); >> bool zswap_load(struct folio *folio); >> +bool zswap_present_test(swp_entry_t swp, int nr_pages); >> void zswap_invalidate(swp_entry_t swp); >> int zswap_swapon(int type, unsigned long nr_pages); >> void zswap_swapoff(int type); >> @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio) >> return false; >> } >> >> +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages) >> +{ >> + return false; >> +} >> + >> static inline void zswap_invalidate(swp_entry_t swp) {} >> static inline int zswap_swapon(int type, unsigned long nr_pages) >> { >> diff --git a/mm/memory.c b/mm/memory.c >> index 03e5452dd0c0..49d243131169 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> swapcache = folio; >> >> if (!folio) { >> - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> + if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) || >> + zswap_present_test(entry, 1)) && >> __swap_count(entry) == 1) { >> /* skip swapcache */ >> folio = alloc_swap_folio(vmf); >> diff --git a/mm/page_io.c b/mm/page_io.c >> index 4aa34862676f..2a15b197968a 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) >> unsigned long pflags; >> bool in_thrashing; >> >> - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); >> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); >> VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 7f00cc918e7c..f4b03071b2fb 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) >> return ret; >> } >> >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) >> +{ >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; >> +} >> + >> +/* Returns true if the entire folio is in zswap */ >> +bool zswap_present_test(swp_entry_t swp, int nr_pages) > > Also, did you check how the performance changes if we bring back the > bitmap of present entries (i.e. what used to be frontswap's bitmap) > instead of the tree lookups here? > I think the cost of tree lookup is not much and compared to zswap_decompress can probably be ignored. zswap_present_test is essentially just xa_load for the first entry, and then xas_next_entry for subsequent entries which is even cheaper than xa_load.
[..] > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > >> return ret; > >> } > >> > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > >> +{ > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > > > > I am not sure I understand what we are looking for here. When does > > this return false? Aren't the zswap trees always allocated during > > swapon? > > > > Hi Yosry, > > Thanks for the review! > > It becomes useful in patch 3 when trying to determine if a large folio can be allocated. > > For e.g. if the swap entry is the last entry of the last tree, and 1M folios are enabled > (nr_pages = 256), then the while loop in zswap_present_test will try to access a tree > that doesn't exist from the 2nd 4K page onwards if we dont have this check in > zswap_present_test. Doesn't swap_pte_batch() make sure that the range of swap entries passed here all corresponds to existing swap entries, and those entries should always have a corresponding zswap tree? How can the passed in range contain an entry that is not in any zswap tree? I feel like I am missing something. > > >> +} > >> + > >> +/* Returns true if the entire folio is in zswap */ > > > > There isn't really a folio at this point, maybe "Returns true if the > > entire range is in zswap"? > > Will change, Thanks! > > > > > Also, this is racy because an exclusive load, invalidation, or > > writeback can cause an entry to be removed from zswap. Under what > > conditions is this safe? The caller can probably guarantee we don't > > race against invalidation, but can we guarantee that concurrent > > exclusive loads or writebacks don't happen? > > > > If the answer is yes, this needs to be properly documented. > > swapcache_prepare should stop things from becoming racy. > > lets say trying to swapin a mTHP of size 32 pages: > - T1 is doing do_swap_page, T2 is doing zswap_writeback. > - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entry, nr_pages) in do_swap_page is not yet called. > - T2 - zswap_writeback_entry starts and lets say writes page 2 to swap. it calls __read_swap_cache_async -> swapcache_prepare increments swap_map count, writes page to swap. Can the folio be dropped from the swapcache at this point (e.g. by reclaim)? If yes, it seems like swapcache_prepare() can succeed and try to read it from zswap. > - T1 - swapcache_prepare is then called and fails and then there will be a pagefault again for it. > > I will try and document this better. We need to establish the rules for zswap_present_test() to not be racy, document them at the definition, establish the safety of racy callers (i.e. can_swapin_thp()), and document them at the call sites. > > > > >> +bool zswap_present_test(swp_entry_t swp, int nr_pages) > >> +{ > >> + pgoff_t offset = swp_offset(swp), tree_max_idx; > >> + int max_idx = 0, i = 0, tree_offset = 0; > >> + unsigned int type = swp_type(swp); > >> + struct zswap_entry *entry = NULL; > >> + struct xarray *tree; > >> + > >> + while (i < nr_pages) { > >> + tree_offset = offset + i; > >> + /* Check if the tree exists. */ > >> + if (!swp_offset_in_zswap(type, tree_offset)) > >> + return false; > >> + > >> + tree = swap_zswap_tree(swp_entry(type, tree_offset)); > >> + XA_STATE(xas, tree, tree_offset); > > > > Please do not mix declarations with code. > > > >> + > >> + tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ? > >> + ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) : > >> + ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES); > > > > Does this work if we always use ALIGN(tree_offset + 1, > > SWAP_ADDRESS_SPACE_PAGES)? > > Yes, I think max_idx = min(offset + nr_pages, ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES)) - 1; > will work. I will test it out, Thanks! Might need to split it over two lines.
[..] > >> diff --git a/mm/zswap.c b/mm/zswap.c > >> index 7f00cc918e7c..f4b03071b2fb 100644 > >> --- a/mm/zswap.c > >> +++ b/mm/zswap.c > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > >> return ret; > >> } > >> > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > >> +{ > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > >> +} > >> + > >> +/* Returns true if the entire folio is in zswap */ > >> +bool zswap_present_test(swp_entry_t swp, int nr_pages) > > > > Also, did you check how the performance changes if we bring back the > > bitmap of present entries (i.e. what used to be frontswap's bitmap) > > instead of the tree lookups here? > > > > I think the cost of tree lookup is not much and compared to zswap_decompress > can probably be ignored. zswap_present_test is essentially just xa_load for > the first entry, and then xas_next_entry for subsequent entries which is even > cheaper than xa_load. Maybe it's worth measuring if it's not too much work. IIUC there is a regression that we don't fully understand with this series, and the extra lookup may be contributing to that. I think it could be just fine, but I can't tell without numbers :)
On Tue, Oct 22, 2024 at 5:46 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > > >> return ret; > > >> } > > >> > > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > > >> +{ > > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > > > > > > I am not sure I understand what we are looking for here. When does > > > this return false? Aren't the zswap trees always allocated during > > > swapon? > > > > > > > Hi Yosry, > > > > Thanks for the review! > > > > It becomes useful in patch 3 when trying to determine if a large folio can be allocated. > > > > For e.g. if the swap entry is the last entry of the last tree, and 1M folios are enabled > > (nr_pages = 256), then the while loop in zswap_present_test will try to access a tree > > that doesn't exist from the 2nd 4K page onwards if we dont have this check in > > zswap_present_test. > > Doesn't swap_pte_batch() make sure that the range of swap entries > passed here all corresponds to existing swap entries, and those > entries should always have a corresponding zswap tree? How can the > passed in range contain an entry that is not in any zswap tree? > > I feel like I am missing something. > > > > > >> +} > > >> + > > >> +/* Returns true if the entire folio is in zswap */ > > > > > > There isn't really a folio at this point, maybe "Returns true if the > > > entire range is in zswap"? > > > > Will change, Thanks! > > > > > > > > Also, this is racy because an exclusive load, invalidation, or > > > writeback can cause an entry to be removed from zswap. Under what > > > conditions is this safe? The caller can probably guarantee we don't > > > race against invalidation, but can we guarantee that concurrent > > > exclusive loads or writebacks don't happen? > > > > > > If the answer is yes, this needs to be properly documented. > > > > swapcache_prepare should stop things from becoming racy. > > > > lets say trying to swapin a mTHP of size 32 pages: > > - T1 is doing do_swap_page, T2 is doing zswap_writeback. > > - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entry, nr_pages) in do_swap_page is not yet called. > > - T2 - zswap_writeback_entry starts and lets say writes page 2 to swap. it calls __read_swap_cache_async -> swapcache_prepare increments swap_map count, writes page to swap. > > Can the folio be dropped from the swapcache at this point (e.g. by > reclaim)? If yes, it seems like swapcache_prepare() can succeed and > try to read it from zswap. > I think you're onto something. Can this happen: say T2 writebacks a couple of tail pages, but not all of them, then drops everything from swap cache. Then T1 can definitely proceed. It would then check again in zswap_load(), which returns false (thanks to zswap_present()) test. All fine and good so far, but then in swap_read_folio(), it would try to fall back to reading the entire large folio from swapfile, which will contain bogus data in pages that have not been written back by T2. I think the problem is swap_read_folio() assumes it always succeeds, and a precondition for successful reading is the large folio must have no mixed backing state for its subpages, which we fail to guarantee before entering swap_read_folio(). This can lead to memory corruption. Buuut, I think all we need to do is just check the backing state again after T1's swapcache_prepare() call. At this point, we are guaranteed to have a stable backing state. If it fails here, then we can just exit and fall back to individual page swapping in.
On Fri, Oct 25, 2024 at 11:19 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Oct 22, 2024 at 5:46 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > > > >> return ret; > > > >> } > > > >> > > > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > > > >> +{ > > > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; > > > > > > > > I am not sure I understand what we are looking for here. When does > > > > this return false? Aren't the zswap trees always allocated during > > > > swapon? > > > > > > > > > > Hi Yosry, > > > > > > Thanks for the review! > > > > > > It becomes useful in patch 3 when trying to determine if a large folio can be allocated. > > > > > > For e.g. if the swap entry is the last entry of the last tree, and 1M folios are enabled > > > (nr_pages = 256), then the while loop in zswap_present_test will try to access a tree > > > that doesn't exist from the 2nd 4K page onwards if we dont have this check in > > > zswap_present_test. > > > > Doesn't swap_pte_batch() make sure that the range of swap entries > > passed here all corresponds to existing swap entries, and those > > entries should always have a corresponding zswap tree? How can the > > passed in range contain an entry that is not in any zswap tree? > > > > I feel like I am missing something. > > > > > > > > >> +} > > > >> + > > > >> +/* Returns true if the entire folio is in zswap */ > > > > > > > > There isn't really a folio at this point, maybe "Returns true if the > > > > entire range is in zswap"? > > > > > > Will change, Thanks! > > > > > > > > > > > Also, this is racy because an exclusive load, invalidation, or > > > > writeback can cause an entry to be removed from zswap. Under what > > > > conditions is this safe? The caller can probably guarantee we don't > > > > race against invalidation, but can we guarantee that concurrent > > > > exclusive loads or writebacks don't happen? > > > > > > > > If the answer is yes, this needs to be properly documented. > > > > > > swapcache_prepare should stop things from becoming racy. > > > > > > lets say trying to swapin a mTHP of size 32 pages: > > > - T1 is doing do_swap_page, T2 is doing zswap_writeback. > > > - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entry, nr_pages) in do_swap_page is not yet called. > > > - T2 - zswap_writeback_entry starts and lets say writes page 2 to swap. it calls __read_swap_cache_async -> swapcache_prepare increments swap_map count, writes page to swap. > > > > Can the folio be dropped from the swapcache at this point (e.g. by > > reclaim)? If yes, it seems like swapcache_prepare() can succeed and > > try to read it from zswap. > > > > I think you're onto something. > > Can this happen: say T2 writebacks a couple of tail pages, but not all > of them, then drops everything from swap cache. Then T1 can definitely > proceed. It would then check again in zswap_load(), which returns > false (thanks to zswap_present()) test. > > All fine and good so far, but then in swap_read_folio(), it would try > to fall back to reading the entire large folio from swapfile, which > will contain bogus data in pages that have not been written back by > T2. > > I think the problem is swap_read_folio() assumes it always succeeds, > and a precondition for successful reading is the large folio must have > no mixed backing state for its subpages, which we fail to guarantee > before entering swap_read_folio(). This can lead to memory corruption. > > Buuut, I think all we need to do is just check the backing state again > after T1's swapcache_prepare() call. At this point, we are guaranteed > to have a stable backing state. If it fails here, then we can just > exit and fall back to individual page swapping in. I think this should work, but we need to take a closer look for other things that can go wrong along this path.
diff --git a/include/linux/zswap.h b/include/linux/zswap.h index d961ead91bf1..e418d75db738 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -27,6 +27,7 @@ struct zswap_lruvec_state { unsigned long zswap_total_pages(void); bool zswap_store(struct folio *folio); bool zswap_load(struct folio *folio); +bool zswap_present_test(swp_entry_t swp, int nr_pages); void zswap_invalidate(swp_entry_t swp); int zswap_swapon(int type, unsigned long nr_pages); void zswap_swapoff(int type); @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio) return false; } +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages) +{ + return false; +} + static inline void zswap_invalidate(swp_entry_t swp) {} static inline int zswap_swapon(int type, unsigned long nr_pages) { diff --git a/mm/memory.c b/mm/memory.c index 03e5452dd0c0..49d243131169 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) swapcache = folio; if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && + if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) || + zswap_present_test(entry, 1)) && __swap_count(entry) == 1) { /* skip swapcache */ folio = alloc_swap_folio(vmf); diff --git a/mm/page_io.c b/mm/page_io.c index 4aa34862676f..2a15b197968a 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) unsigned long pflags; bool in_thrashing; - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); diff --git a/mm/zswap.c b/mm/zswap.c index 7f00cc918e7c..f4b03071b2fb 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) return ret; } +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) +{ + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_trees[type]; +} + +/* Returns true if the entire folio is in zswap */ +bool zswap_present_test(swp_entry_t swp, int nr_pages) +{ + pgoff_t offset = swp_offset(swp), tree_max_idx; + int max_idx = 0, i = 0, tree_offset = 0; + unsigned int type = swp_type(swp); + struct zswap_entry *entry = NULL; + struct xarray *tree; + + while (i < nr_pages) { + tree_offset = offset + i; + /* Check if the tree exists. */ + if (!swp_offset_in_zswap(type, tree_offset)) + return false; + + tree = swap_zswap_tree(swp_entry(type, tree_offset)); + XA_STATE(xas, tree, tree_offset); + + tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ? + ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) : + ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES); + max_idx = min(offset + nr_pages, tree_max_idx) - 1; + rcu_read_lock(); + xas_for_each(&xas, entry, max_idx) { + if (xas_retry(&xas, entry)) + continue; + i++; + } + rcu_read_unlock(); + /* + * If xas_for_each exits because entry is NULL and + * the number of entries checked are less then max idx, + * then zswap does not contain the entire folio. + */ + if (!entry && offset + i <= max_idx) + return false; + } + + return true; +} + bool zswap_load(struct folio *folio) { swp_entry_t swp = folio->swap;
As mentioned in [1], there is a significant improvement in no readahead swapin performance for super fast devices when skipping swapcache. With large folio zswapin support added in later patches, this will also mean this path will also act as "readahead" by swapping in multiple pages into large folios. further improving performance. [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93 Signed-off-by: Usama Arif <usamaarif642@gmail.com> --- include/linux/zswap.h | 6 ++++++ mm/memory.c | 3 ++- mm/page_io.c | 1 - mm/zswap.c | 46 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 54 insertions(+), 2 deletions(-)