diff mbox series

[RFC,3/4] mm/zswap: add support for large folio zswapin

Message ID 20241018105026.2521366-4-usamaarif642@gmail.com (mailing list archive)
State New
Headers show
Series [RFC,1/4] mm/zswap: skip swapcache for swapping in zswap pages | expand

Commit Message

Usama Arif Oct. 18, 2024, 10:48 a.m. UTC
At time of folio allocation, alloc_swap_folio checks if the entire
folio is in zswap to determine folio order.
During swap_read_folio, zswap_load will check if the entire folio
is in zswap, and if it is, it will iterate through the pages in
folio and decompress them.
This will mean the benefits of large folios (fewer page faults, batched
PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
and amd) are not lost at swap out when using zswap.
This patch does not add support for hybrid backends (i.e. folios
partly present swap and zswap).

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/memory.c | 13 +++-------
 mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
 2 files changed, 34 insertions(+), 47 deletions(-)

Comments

Barry Song Oct. 21, 2024, 5:49 a.m. UTC | #1
On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
> At time of folio allocation, alloc_swap_folio checks if the entire
> folio is in zswap to determine folio order.
> During swap_read_folio, zswap_load will check if the entire folio
> is in zswap, and if it is, it will iterate through the pages in
> folio and decompress them.
> This will mean the benefits of large folios (fewer page faults, batched
> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
> and amd) are not lost at swap out when using zswap.
> This patch does not add support for hybrid backends (i.e. folios
> partly present swap and zswap).
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  mm/memory.c | 13 +++-------
>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
>  2 files changed, 34 insertions(+), 47 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 49d243131169..75f7b9f5fb32 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
>
>         /*
>          * swap_read_folio() can't handle the case a large folio is hybridly
> -        * from different backends. And they are likely corner cases. Similar
> -        * things might be added once zswap support large folios.
> +        * from different backends. And they are likely corner cases.
>          */
>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
>                 return false;
>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
>                 return false;
> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
> +               return false;
>
>         return true;
>  }
> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>         if (unlikely(userfaultfd_armed(vma)))
>                 goto fallback;
>
> -       /*
> -        * A large swapped out folio could be partially or fully in zswap. We
> -        * lack handling for such cases, so fallback to swapping in order-0
> -        * folio.
> -        */
> -       if (!zswap_never_enabled())
> -               goto fallback;
> -
>         entry = pte_to_swp_entry(vmf->orig_pte);
>         /*
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9cc91ae31116..a5aa86c24060 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
>
>  bool zswap_load(struct folio *folio)
>  {
> +       int nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
> +       unsigned int type = swp_type(swp);
>         pgoff_t offset = swp_offset(swp);
>         bool swapcache = folio_test_swapcache(folio);
> -       struct xarray *tree = swap_zswap_tree(swp);
> +       struct xarray *tree;
>         struct zswap_entry *entry;
> +       int i;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
>         if (zswap_never_enabled())
>                 return false;
>
> -       /*
> -        * Large folios should not be swapped in while zswap is being used, as
> -        * they are not properly handled. Zswap does not properly load large
> -        * folios, and a large folio may only be partially in zswap.
> -        *
> -        * Return true without marking the folio uptodate so that an IO error is
> -        * emitted (e.g. do_swap_page() will sigbus).
> -        */
> -       if (WARN_ON_ONCE(folio_test_large(folio)))
> -               return true;
> -
> -       /*
> -        * When reading into the swapcache, invalidate our entry. The
> -        * swapcache can be the authoritative owner of the page and
> -        * its mappings, and the pressure that results from having two
> -        * in-memory copies outweighs any benefits of caching the
> -        * compression work.
> -        *
> -        * (Most swapins go through the swapcache. The notable
> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> -        * files, which reads into a private page and may free it if
> -        * the fault fails. We remain the primary owner of the entry.)
> -        */
> -       if (swapcache)
> -               entry = xa_erase(tree, offset);
> -       else
> -               entry = xa_load(tree, offset);
> -
> -       if (!entry)
> +       if (!zswap_present_test(folio->swap, nr_pages))
>                 return false;

Hi Usama,

Is there any chance that zswap_present_test() returns true
in do_swap_page() but false in zswap_load()? If that’s
possible, could we be missing something? For example,
could it be that zswap has been partially released (with
part of it still present) during an mTHP swap-in?

If this happens with an mTHP, my understanding is that
we shouldn't proceed with reading corrupted data from the
disk backend.

>
> -       zswap_decompress(entry, &folio->page);
> +       for (i = 0; i < nr_pages; ++i) {
> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
> +               /*
> +                * When reading into the swapcache, invalidate our entry. The
> +                * swapcache can be the authoritative owner of the page and
> +                * its mappings, and the pressure that results from having two
> +                * in-memory copies outweighs any benefits of caching the
> +                * compression work.
> +                *
> +                * (Swapins with swap count > 1 go through the swapcache.
> +                * For swap count == 1, the swapcache is skipped and we
> +                * remain the primary owner of the entry.)
> +                */
> +               if (swapcache)
> +                       entry = xa_erase(tree, offset + i);
> +               else
> +                       entry = xa_load(tree, offset + i);
>
> -       count_vm_event(ZSWPIN);
> -       if (entry->objcg)
> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
> +               zswap_decompress(entry, folio_page(folio, i));
>
> -       if (swapcache) {
> -               zswap_entry_free(entry);
> -               folio_mark_dirty(folio);
> +               if (entry->objcg)
> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
> +               if (swapcache)
> +                       zswap_entry_free(entry);
>         }
>
> +       count_vm_events(ZSWPIN, nr_pages);
> +       if (swapcache)
> +               folio_mark_dirty(folio);
> +
>         folio_mark_uptodate(folio);
>         return true;
>  }
> --
> 2.43.5
>

Thanks
barry
Usama Arif Oct. 21, 2024, 10:44 a.m. UTC | #2
On 21/10/2024 06:49, Barry Song wrote:
> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> At time of folio allocation, alloc_swap_folio checks if the entire
>> folio is in zswap to determine folio order.
>> During swap_read_folio, zswap_load will check if the entire folio
>> is in zswap, and if it is, it will iterate through the pages in
>> folio and decompress them.
>> This will mean the benefits of large folios (fewer page faults, batched
>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
>> and amd) are not lost at swap out when using zswap.
>> This patch does not add support for hybrid backends (i.e. folios
>> partly present swap and zswap).
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  mm/memory.c | 13 +++-------
>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
>>  2 files changed, 34 insertions(+), 47 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 49d243131169..75f7b9f5fb32 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
>>
>>         /*
>>          * swap_read_folio() can't handle the case a large folio is hybridly
>> -        * from different backends. And they are likely corner cases. Similar
>> -        * things might be added once zswap support large folios.
>> +        * from different backends. And they are likely corner cases.
>>          */
>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
>>                 return false;
>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
>>                 return false;
>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
>> +               return false;
>>
>>         return true;
>>  }
>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>         if (unlikely(userfaultfd_armed(vma)))
>>                 goto fallback;
>>
>> -       /*
>> -        * A large swapped out folio could be partially or fully in zswap. We
>> -        * lack handling for such cases, so fallback to swapping in order-0
>> -        * folio.
>> -        */
>> -       if (!zswap_never_enabled())
>> -               goto fallback;
>> -
>>         entry = pte_to_swp_entry(vmf->orig_pte);
>>         /*
>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 9cc91ae31116..a5aa86c24060 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
>>
>>  bool zswap_load(struct folio *folio)
>>  {
>> +       int nr_pages = folio_nr_pages(folio);
>>         swp_entry_t swp = folio->swap;
>> +       unsigned int type = swp_type(swp);
>>         pgoff_t offset = swp_offset(swp);
>>         bool swapcache = folio_test_swapcache(folio);
>> -       struct xarray *tree = swap_zswap_tree(swp);
>> +       struct xarray *tree;
>>         struct zswap_entry *entry;
>> +       int i;
>>
>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>
>>         if (zswap_never_enabled())
>>                 return false;
>>
>> -       /*
>> -        * Large folios should not be swapped in while zswap is being used, as
>> -        * they are not properly handled. Zswap does not properly load large
>> -        * folios, and a large folio may only be partially in zswap.
>> -        *
>> -        * Return true without marking the folio uptodate so that an IO error is
>> -        * emitted (e.g. do_swap_page() will sigbus).
>> -        */
>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
>> -               return true;
>> -
>> -       /*
>> -        * When reading into the swapcache, invalidate our entry. The
>> -        * swapcache can be the authoritative owner of the page and
>> -        * its mappings, and the pressure that results from having two
>> -        * in-memory copies outweighs any benefits of caching the
>> -        * compression work.
>> -        *
>> -        * (Most swapins go through the swapcache. The notable
>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
>> -        * files, which reads into a private page and may free it if
>> -        * the fault fails. We remain the primary owner of the entry.)
>> -        */
>> -       if (swapcache)
>> -               entry = xa_erase(tree, offset);
>> -       else
>> -               entry = xa_load(tree, offset);
>> -
>> -       if (!entry)
>> +       if (!zswap_present_test(folio->swap, nr_pages))
>>                 return false;
> 
> Hi Usama,
> 
> Is there any chance that zswap_present_test() returns true
> in do_swap_page() but false in zswap_load()? If that’s
> possible, could we be missing something? For example,
> could it be that zswap has been partially released (with
> part of it still present) during an mTHP swap-in?
> 
> If this happens with an mTHP, my understanding is that
> we shouldn't proceed with reading corrupted data from the
> disk backend.
> 

If its not swapcache, the zswap entry is not deleted so I think
it should be ok?

We can check over here if the entire folio is in zswap,
and if not, return true without marking the folio uptodate
to give an error.


>>
>> -       zswap_decompress(entry, &folio->page);
>> +       for (i = 0; i < nr_pages; ++i) {
>> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +               /*
>> +                * When reading into the swapcache, invalidate our entry. The
>> +                * swapcache can be the authoritative owner of the page and
>> +                * its mappings, and the pressure that results from having two
>> +                * in-memory copies outweighs any benefits of caching the
>> +                * compression work.
>> +                *
>> +                * (Swapins with swap count > 1 go through the swapcache.
>> +                * For swap count == 1, the swapcache is skipped and we
>> +                * remain the primary owner of the entry.)
>> +                */
>> +               if (swapcache)
>> +                       entry = xa_erase(tree, offset + i);
>> +               else
>> +                       entry = xa_load(tree, offset + i);
>>
>> -       count_vm_event(ZSWPIN);
>> -       if (entry->objcg)
>> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
>> +               zswap_decompress(entry, folio_page(folio, i));
>>
>> -       if (swapcache) {
>> -               zswap_entry_free(entry);
>> -               folio_mark_dirty(folio);
>> +               if (entry->objcg)
>> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
>> +               if (swapcache)
>> +                       zswap_entry_free(entry);
>>         }
>>
>> +       count_vm_events(ZSWPIN, nr_pages);
>> +       if (swapcache)
>> +               folio_mark_dirty(folio);
>> +
>>         folio_mark_uptodate(folio);
>>         return true;
>>  }
>> --
>> 2.43.5
>>
> 
> Thanks
> barry
Barry Song Oct. 21, 2024, 10:55 a.m. UTC | #3
On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 21/10/2024 06:49, Barry Song wrote:
> > On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> At time of folio allocation, alloc_swap_folio checks if the entire
> >> folio is in zswap to determine folio order.
> >> During swap_read_folio, zswap_load will check if the entire folio
> >> is in zswap, and if it is, it will iterate through the pages in
> >> folio and decompress them.
> >> This will mean the benefits of large folios (fewer page faults, batched
> >> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
> >> and amd) are not lost at swap out when using zswap.
> >> This patch does not add support for hybrid backends (i.e. folios
> >> partly present swap and zswap).
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>  mm/memory.c | 13 +++-------
> >>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
> >>  2 files changed, 34 insertions(+), 47 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 49d243131169..75f7b9f5fb32 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> >>
> >>         /*
> >>          * swap_read_folio() can't handle the case a large folio is hybridly
> >> -        * from different backends. And they are likely corner cases. Similar
> >> -        * things might be added once zswap support large folios.
> >> +        * from different backends. And they are likely corner cases.
> >>          */
> >>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> >>                 return false;
> >>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> >>                 return false;
> >> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
> >> +               return false;
> >>
> >>         return true;
> >>  }
> >> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>         if (unlikely(userfaultfd_armed(vma)))
> >>                 goto fallback;
> >>
> >> -       /*
> >> -        * A large swapped out folio could be partially or fully in zswap. We
> >> -        * lack handling for such cases, so fallback to swapping in order-0
> >> -        * folio.
> >> -        */
> >> -       if (!zswap_never_enabled())
> >> -               goto fallback;
> >> -
> >>         entry = pte_to_swp_entry(vmf->orig_pte);
> >>         /*
> >>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 9cc91ae31116..a5aa86c24060 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
> >>
> >>  bool zswap_load(struct folio *folio)
> >>  {
> >> +       int nr_pages = folio_nr_pages(folio);
> >>         swp_entry_t swp = folio->swap;
> >> +       unsigned int type = swp_type(swp);
> >>         pgoff_t offset = swp_offset(swp);
> >>         bool swapcache = folio_test_swapcache(folio);
> >> -       struct xarray *tree = swap_zswap_tree(swp);
> >> +       struct xarray *tree;
> >>         struct zswap_entry *entry;
> >> +       int i;
> >>
> >>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>
> >>         if (zswap_never_enabled())
> >>                 return false;
> >>
> >> -       /*
> >> -        * Large folios should not be swapped in while zswap is being used, as
> >> -        * they are not properly handled. Zswap does not properly load large
> >> -        * folios, and a large folio may only be partially in zswap.
> >> -        *
> >> -        * Return true without marking the folio uptodate so that an IO error is
> >> -        * emitted (e.g. do_swap_page() will sigbus).
> >> -        */
> >> -       if (WARN_ON_ONCE(folio_test_large(folio)))
> >> -               return true;
> >> -
> >> -       /*
> >> -        * When reading into the swapcache, invalidate our entry. The
> >> -        * swapcache can be the authoritative owner of the page and
> >> -        * its mappings, and the pressure that results from having two
> >> -        * in-memory copies outweighs any benefits of caching the
> >> -        * compression work.
> >> -        *
> >> -        * (Most swapins go through the swapcache. The notable
> >> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> >> -        * files, which reads into a private page and may free it if
> >> -        * the fault fails. We remain the primary owner of the entry.)
> >> -        */
> >> -       if (swapcache)
> >> -               entry = xa_erase(tree, offset);
> >> -       else
> >> -               entry = xa_load(tree, offset);
> >> -
> >> -       if (!entry)
> >> +       if (!zswap_present_test(folio->swap, nr_pages))
> >>                 return false;
> >
> > Hi Usama,
> >
> > Is there any chance that zswap_present_test() returns true
> > in do_swap_page() but false in zswap_load()? If that’s
> > possible, could we be missing something? For example,
> > could it be that zswap has been partially released (with
> > part of it still present) during an mTHP swap-in?
> >
> > If this happens with an mTHP, my understanding is that
> > we shouldn't proceed with reading corrupted data from the
> > disk backend.
> >
>
> If its not swapcache, the zswap entry is not deleted so I think
> it should be ok?
>
> We can check over here if the entire folio is in zswap,
> and if not, return true without marking the folio uptodate
> to give an error.

We have swapcache_prepare() called in do_swap_page(), which should
have protected these entries from being partially freed by other processes
(for example, if someone falls back to small folios for the same address).
Therefore, I believe that zswap_present_test() cannot be false for mTHP in
the current case where only synchronous I/O is supported.

the below might help detect the bug?

if (!zswap_present_test(folio->swap, nr_pages)) {
     if (WARN_ON_ONCE(nr_pages > 1))
                return true;
     return false;
}

the code seems quite ugly :-) do we have some way to unify the code
for large and small folios?

not quite sure about shmem though....

>
>
> >>
> >> -       zswap_decompress(entry, &folio->page);
> >> +       for (i = 0; i < nr_pages; ++i) {
> >> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
> >> +               /*
> >> +                * When reading into the swapcache, invalidate our entry. The
> >> +                * swapcache can be the authoritative owner of the page and
> >> +                * its mappings, and the pressure that results from having two
> >> +                * in-memory copies outweighs any benefits of caching the
> >> +                * compression work.
> >> +                *
> >> +                * (Swapins with swap count > 1 go through the swapcache.
> >> +                * For swap count == 1, the swapcache is skipped and we
> >> +                * remain the primary owner of the entry.)
> >> +                */
> >> +               if (swapcache)
> >> +                       entry = xa_erase(tree, offset + i);
> >> +               else
> >> +                       entry = xa_load(tree, offset + i);
> >>
> >> -       count_vm_event(ZSWPIN);
> >> -       if (entry->objcg)
> >> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >> +               zswap_decompress(entry, folio_page(folio, i));
> >>
> >> -       if (swapcache) {
> >> -               zswap_entry_free(entry);
> >> -               folio_mark_dirty(folio);
> >> +               if (entry->objcg)
> >> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
> >> +               if (swapcache)
> >> +                       zswap_entry_free(entry);
> >>         }
> >>
> >> +       count_vm_events(ZSWPIN, nr_pages);
> >> +       if (swapcache)
> >> +               folio_mark_dirty(folio);
> >> +
> >>         folio_mark_uptodate(folio);
> >>         return true;
> >>  }
> >> --
> >> 2.43.5
> >>
> >

Thanks
barry
Usama Arif Oct. 21, 2024, 12:21 p.m. UTC | #4
On 21/10/2024 11:55, Barry Song wrote:
> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 21/10/2024 06:49, Barry Song wrote:
>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> At time of folio allocation, alloc_swap_folio checks if the entire
>>>> folio is in zswap to determine folio order.
>>>> During swap_read_folio, zswap_load will check if the entire folio
>>>> is in zswap, and if it is, it will iterate through the pages in
>>>> folio and decompress them.
>>>> This will mean the benefits of large folios (fewer page faults, batched
>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
>>>> and amd) are not lost at swap out when using zswap.
>>>> This patch does not add support for hybrid backends (i.e. folios
>>>> partly present swap and zswap).
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>  mm/memory.c | 13 +++-------
>>>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
>>>>  2 files changed, 34 insertions(+), 47 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 49d243131169..75f7b9f5fb32 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
>>>>
>>>>         /*
>>>>          * swap_read_folio() can't handle the case a large folio is hybridly
>>>> -        * from different backends. And they are likely corner cases. Similar
>>>> -        * things might be added once zswap support large folios.
>>>> +        * from different backends. And they are likely corner cases.
>>>>          */
>>>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
>>>>                 return false;
>>>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
>>>>                 return false;
>>>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
>>>> +               return false;
>>>>
>>>>         return true;
>>>>  }
>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>>         if (unlikely(userfaultfd_armed(vma)))
>>>>                 goto fallback;
>>>>
>>>> -       /*
>>>> -        * A large swapped out folio could be partially or fully in zswap. We
>>>> -        * lack handling for such cases, so fallback to swapping in order-0
>>>> -        * folio.
>>>> -        */
>>>> -       if (!zswap_never_enabled())
>>>> -               goto fallback;
>>>> -
>>>>         entry = pte_to_swp_entry(vmf->orig_pte);
>>>>         /*
>>>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> index 9cc91ae31116..a5aa86c24060 100644
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
>>>>
>>>>  bool zswap_load(struct folio *folio)
>>>>  {
>>>> +       int nr_pages = folio_nr_pages(folio);
>>>>         swp_entry_t swp = folio->swap;
>>>> +       unsigned int type = swp_type(swp);
>>>>         pgoff_t offset = swp_offset(swp);
>>>>         bool swapcache = folio_test_swapcache(folio);
>>>> -       struct xarray *tree = swap_zswap_tree(swp);
>>>> +       struct xarray *tree;
>>>>         struct zswap_entry *entry;
>>>> +       int i;
>>>>
>>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>>
>>>>         if (zswap_never_enabled())
>>>>                 return false;
>>>>
>>>> -       /*
>>>> -        * Large folios should not be swapped in while zswap is being used, as
>>>> -        * they are not properly handled. Zswap does not properly load large
>>>> -        * folios, and a large folio may only be partially in zswap.
>>>> -        *
>>>> -        * Return true without marking the folio uptodate so that an IO error is
>>>> -        * emitted (e.g. do_swap_page() will sigbus).
>>>> -        */
>>>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
>>>> -               return true;
>>>> -
>>>> -       /*
>>>> -        * When reading into the swapcache, invalidate our entry. The
>>>> -        * swapcache can be the authoritative owner of the page and
>>>> -        * its mappings, and the pressure that results from having two
>>>> -        * in-memory copies outweighs any benefits of caching the
>>>> -        * compression work.
>>>> -        *
>>>> -        * (Most swapins go through the swapcache. The notable
>>>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
>>>> -        * files, which reads into a private page and may free it if
>>>> -        * the fault fails. We remain the primary owner of the entry.)
>>>> -        */
>>>> -       if (swapcache)
>>>> -               entry = xa_erase(tree, offset);
>>>> -       else
>>>> -               entry = xa_load(tree, offset);
>>>> -
>>>> -       if (!entry)
>>>> +       if (!zswap_present_test(folio->swap, nr_pages))
>>>>                 return false;
>>>
>>> Hi Usama,
>>>
>>> Is there any chance that zswap_present_test() returns true
>>> in do_swap_page() but false in zswap_load()? If that’s
>>> possible, could we be missing something? For example,
>>> could it be that zswap has been partially released (with
>>> part of it still present) during an mTHP swap-in?
>>>
>>> If this happens with an mTHP, my understanding is that
>>> we shouldn't proceed with reading corrupted data from the
>>> disk backend.
>>>
>>
>> If its not swapcache, the zswap entry is not deleted so I think
>> it should be ok?
>>
>> We can check over here if the entire folio is in zswap,
>> and if not, return true without marking the folio uptodate
>> to give an error.
> 
> We have swapcache_prepare() called in do_swap_page(), which should
> have protected these entries from being partially freed by other processes
> (for example, if someone falls back to small folios for the same address).
> Therefore, I believe that zswap_present_test() cannot be false for mTHP in
> the current case where only synchronous I/O is supported.
> 
> the below might help detect the bug?
> 
> if (!zswap_present_test(folio->swap, nr_pages)) {
>      if (WARN_ON_ONCE(nr_pages > 1))
>                 return true;
>      return false;
> }
> 

I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap,
it should still return false. So would need to check the whole folio if we want to
warn. But I think if we are sure the code is ok, it is an unnecessary check.

> the code seems quite ugly :-) do we have some way to unify the code
> for large and small folios?
> 
> not quite sure about shmem though....
> 

If its shmem, and the swap_count goes to 1, I think its still ok? because
then the folio will be gotten from swap_cache_get_folio if it has already
been in swapcache.

>>
>>
>>>>
>>>> -       zswap_decompress(entry, &folio->page);
>>>> +       for (i = 0; i < nr_pages; ++i) {
>>>> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
>>>> +               /*
>>>> +                * When reading into the swapcache, invalidate our entry. The
>>>> +                * swapcache can be the authoritative owner of the page and
>>>> +                * its mappings, and the pressure that results from having two
>>>> +                * in-memory copies outweighs any benefits of caching the
>>>> +                * compression work.
>>>> +                *
>>>> +                * (Swapins with swap count > 1 go through the swapcache.
>>>> +                * For swap count == 1, the swapcache is skipped and we
>>>> +                * remain the primary owner of the entry.)
>>>> +                */
>>>> +               if (swapcache)
>>>> +                       entry = xa_erase(tree, offset + i);
>>>> +               else
>>>> +                       entry = xa_load(tree, offset + i);
>>>>
>>>> -       count_vm_event(ZSWPIN);
>>>> -       if (entry->objcg)
>>>> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
>>>> +               zswap_decompress(entry, folio_page(folio, i));
>>>>
>>>> -       if (swapcache) {
>>>> -               zswap_entry_free(entry);
>>>> -               folio_mark_dirty(folio);
>>>> +               if (entry->objcg)
>>>> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
>>>> +               if (swapcache)
>>>> +                       zswap_entry_free(entry);
>>>>         }
>>>>
>>>> +       count_vm_events(ZSWPIN, nr_pages);
>>>> +       if (swapcache)
>>>> +               folio_mark_dirty(folio);
>>>> +
>>>>         folio_mark_uptodate(folio);
>>>>         return true;
>>>>  }
>>>> --
>>>> 2.43.5
>>>>
>>>
> 
> Thanks
> barry
Barry Song Oct. 21, 2024, 8:28 p.m. UTC | #5
On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 21/10/2024 11:55, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 21/10/2024 06:49, Barry Song wrote:
> >>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>> At time of folio allocation, alloc_swap_folio checks if the entire
> >>>> folio is in zswap to determine folio order.
> >>>> During swap_read_folio, zswap_load will check if the entire folio
> >>>> is in zswap, and if it is, it will iterate through the pages in
> >>>> folio and decompress them.
> >>>> This will mean the benefits of large folios (fewer page faults, batched
> >>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
> >>>> and amd) are not lost at swap out when using zswap.
> >>>> This patch does not add support for hybrid backends (i.e. folios
> >>>> partly present swap and zswap).
> >>>>
> >>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >>>> ---
> >>>>  mm/memory.c | 13 +++-------
> >>>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
> >>>>  2 files changed, 34 insertions(+), 47 deletions(-)
> >>>>
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 49d243131169..75f7b9f5fb32 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> >>>>
> >>>>         /*
> >>>>          * swap_read_folio() can't handle the case a large folio is hybridly
> >>>> -        * from different backends. And they are likely corner cases. Similar
> >>>> -        * things might be added once zswap support large folios.
> >>>> +        * from different backends. And they are likely corner cases.
> >>>>          */
> >>>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> >>>>                 return false;
> >>>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> >>>>                 return false;
> >>>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
> >>>> +               return false;
> >>>>
> >>>>         return true;
> >>>>  }
> >>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>>>         if (unlikely(userfaultfd_armed(vma)))
> >>>>                 goto fallback;
> >>>>
> >>>> -       /*
> >>>> -        * A large swapped out folio could be partially or fully in zswap. We
> >>>> -        * lack handling for such cases, so fallback to swapping in order-0
> >>>> -        * folio.
> >>>> -        */
> >>>> -       if (!zswap_never_enabled())
> >>>> -               goto fallback;
> >>>> -
> >>>>         entry = pte_to_swp_entry(vmf->orig_pte);
> >>>>         /*
> >>>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>>> index 9cc91ae31116..a5aa86c24060 100644
> >>>> --- a/mm/zswap.c
> >>>> +++ b/mm/zswap.c
> >>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
> >>>>
> >>>>  bool zswap_load(struct folio *folio)
> >>>>  {
> >>>> +       int nr_pages = folio_nr_pages(folio);
> >>>>         swp_entry_t swp = folio->swap;
> >>>> +       unsigned int type = swp_type(swp);
> >>>>         pgoff_t offset = swp_offset(swp);
> >>>>         bool swapcache = folio_test_swapcache(folio);
> >>>> -       struct xarray *tree = swap_zswap_tree(swp);
> >>>> +       struct xarray *tree;
> >>>>         struct zswap_entry *entry;
> >>>> +       int i;
> >>>>
> >>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>>>
> >>>>         if (zswap_never_enabled())
> >>>>                 return false;
> >>>>
> >>>> -       /*
> >>>> -        * Large folios should not be swapped in while zswap is being used, as
> >>>> -        * they are not properly handled. Zswap does not properly load large
> >>>> -        * folios, and a large folio may only be partially in zswap.
> >>>> -        *
> >>>> -        * Return true without marking the folio uptodate so that an IO error is
> >>>> -        * emitted (e.g. do_swap_page() will sigbus).
> >>>> -        */
> >>>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
> >>>> -               return true;
> >>>> -
> >>>> -       /*
> >>>> -        * When reading into the swapcache, invalidate our entry. The
> >>>> -        * swapcache can be the authoritative owner of the page and
> >>>> -        * its mappings, and the pressure that results from having two
> >>>> -        * in-memory copies outweighs any benefits of caching the
> >>>> -        * compression work.
> >>>> -        *
> >>>> -        * (Most swapins go through the swapcache. The notable
> >>>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> >>>> -        * files, which reads into a private page and may free it if
> >>>> -        * the fault fails. We remain the primary owner of the entry.)
> >>>> -        */
> >>>> -       if (swapcache)
> >>>> -               entry = xa_erase(tree, offset);
> >>>> -       else
> >>>> -               entry = xa_load(tree, offset);
> >>>> -
> >>>> -       if (!entry)
> >>>> +       if (!zswap_present_test(folio->swap, nr_pages))
> >>>>                 return false;
> >>>
> >>> Hi Usama,
> >>>
> >>> Is there any chance that zswap_present_test() returns true
> >>> in do_swap_page() but false in zswap_load()? If that’s
> >>> possible, could we be missing something? For example,
> >>> could it be that zswap has been partially released (with
> >>> part of it still present) during an mTHP swap-in?
> >>>
> >>> If this happens with an mTHP, my understanding is that
> >>> we shouldn't proceed with reading corrupted data from the
> >>> disk backend.
> >>>
> >>
> >> If its not swapcache, the zswap entry is not deleted so I think
> >> it should be ok?
> >>
> >> We can check over here if the entire folio is in zswap,
> >> and if not, return true without marking the folio uptodate
> >> to give an error.
> >
> > We have swapcache_prepare() called in do_swap_page(), which should
> > have protected these entries from being partially freed by other processes
> > (for example, if someone falls back to small folios for the same address).
> > Therefore, I believe that zswap_present_test() cannot be false for mTHP in
> > the current case where only synchronous I/O is supported.
> >
> > the below might help detect the bug?
> >
> > if (!zswap_present_test(folio->swap, nr_pages)) {
> >      if (WARN_ON_ONCE(nr_pages > 1))
> >                 return true;
> >      return false;
> > }
> >
>
> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap,
> it should still return false. So would need to check the whole folio if we want to
> warn. But I think if we are sure the code is ok, it is an unnecessary check.

my point is that zswap_present_test() can't differentiate
1. the *whole* folio is not in zswap
2. the folio is *partially* not in zswap

in case 2, returning false is wrong.

And when nr_pages > 1, we have already confirmed earlier in
do_swap_page() that zswap_present_test() is true. At this point,
it must always be true; if it's false, it indicates a bug.

>
> > the code seems quite ugly :-) do we have some way to unify the code
> > for large and small folios?
> >
> > not quite sure about shmem though....
> >
>
> If its shmem, and the swap_count goes to 1, I think its still ok? because
> then the folio will be gotten from swap_cache_get_folio if it has already
> been in swapcache.
>
> >>
> >>
> >>>>
> >>>> -       zswap_decompress(entry, &folio->page);
> >>>> +       for (i = 0; i < nr_pages; ++i) {
> >>>> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
> >>>> +               /*
> >>>> +                * When reading into the swapcache, invalidate our entry. The
> >>>> +                * swapcache can be the authoritative owner of the page and
> >>>> +                * its mappings, and the pressure that results from having two
> >>>> +                * in-memory copies outweighs any benefits of caching the
> >>>> +                * compression work.
> >>>> +                *
> >>>> +                * (Swapins with swap count > 1 go through the swapcache.
> >>>> +                * For swap count == 1, the swapcache is skipped and we
> >>>> +                * remain the primary owner of the entry.)
> >>>> +                */
> >>>> +               if (swapcache)
> >>>> +                       entry = xa_erase(tree, offset + i);
> >>>> +               else
> >>>> +                       entry = xa_load(tree, offset + i);
> >>>>
> >>>> -       count_vm_event(ZSWPIN);
> >>>> -       if (entry->objcg)
> >>>> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >>>> +               zswap_decompress(entry, folio_page(folio, i));
> >>>>
> >>>> -       if (swapcache) {
> >>>> -               zswap_entry_free(entry);
> >>>> -               folio_mark_dirty(folio);
> >>>> +               if (entry->objcg)
> >>>> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
> >>>> +               if (swapcache)
> >>>> +                       zswap_entry_free(entry);
> >>>>         }
> >>>>
> >>>> +       count_vm_events(ZSWPIN, nr_pages);
> >>>> +       if (swapcache)
> >>>> +               folio_mark_dirty(folio);
> >>>> +
> >>>>         folio_mark_uptodate(folio);
> >>>>         return true;
> >>>>  }
> >>>> --
> >>>> 2.43.5
> >>>>
> >>>
> >
> > Thanks
> > barry
>
Usama Arif Oct. 21, 2024, 8:57 p.m. UTC | #6
On 21/10/2024 21:28, Barry Song wrote:
> On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 21/10/2024 11:55, Barry Song wrote:
>>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 21/10/2024 06:49, Barry Song wrote:
>>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>> At time of folio allocation, alloc_swap_folio checks if the entire
>>>>>> folio is in zswap to determine folio order.
>>>>>> During swap_read_folio, zswap_load will check if the entire folio
>>>>>> is in zswap, and if it is, it will iterate through the pages in
>>>>>> folio and decompress them.
>>>>>> This will mean the benefits of large folios (fewer page faults, batched
>>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
>>>>>> and amd) are not lost at swap out when using zswap.
>>>>>> This patch does not add support for hybrid backends (i.e. folios
>>>>>> partly present swap and zswap).
>>>>>>
>>>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>>>> ---
>>>>>>  mm/memory.c | 13 +++-------
>>>>>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
>>>>>>  2 files changed, 34 insertions(+), 47 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 49d243131169..75f7b9f5fb32 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
>>>>>>
>>>>>>         /*
>>>>>>          * swap_read_folio() can't handle the case a large folio is hybridly
>>>>>> -        * from different backends. And they are likely corner cases. Similar
>>>>>> -        * things might be added once zswap support large folios.
>>>>>> +        * from different backends. And they are likely corner cases.
>>>>>>          */
>>>>>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
>>>>>>                 return false;
>>>>>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
>>>>>>                 return false;
>>>>>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
>>>>>> +               return false;
>>>>>>
>>>>>>         return true;
>>>>>>  }
>>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>>>>         if (unlikely(userfaultfd_armed(vma)))
>>>>>>                 goto fallback;
>>>>>>
>>>>>> -       /*
>>>>>> -        * A large swapped out folio could be partially or fully in zswap. We
>>>>>> -        * lack handling for such cases, so fallback to swapping in order-0
>>>>>> -        * folio.
>>>>>> -        */
>>>>>> -       if (!zswap_never_enabled())
>>>>>> -               goto fallback;
>>>>>> -
>>>>>>         entry = pte_to_swp_entry(vmf->orig_pte);
>>>>>>         /*
>>>>>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>>>> index 9cc91ae31116..a5aa86c24060 100644
>>>>>> --- a/mm/zswap.c
>>>>>> +++ b/mm/zswap.c
>>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
>>>>>>
>>>>>>  bool zswap_load(struct folio *folio)
>>>>>>  {
>>>>>> +       int nr_pages = folio_nr_pages(folio);
>>>>>>         swp_entry_t swp = folio->swap;
>>>>>> +       unsigned int type = swp_type(swp);
>>>>>>         pgoff_t offset = swp_offset(swp);
>>>>>>         bool swapcache = folio_test_swapcache(folio);
>>>>>> -       struct xarray *tree = swap_zswap_tree(swp);
>>>>>> +       struct xarray *tree;
>>>>>>         struct zswap_entry *entry;
>>>>>> +       int i;
>>>>>>
>>>>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>>>>
>>>>>>         if (zswap_never_enabled())
>>>>>>                 return false;
>>>>>>
>>>>>> -       /*
>>>>>> -        * Large folios should not be swapped in while zswap is being used, as
>>>>>> -        * they are not properly handled. Zswap does not properly load large
>>>>>> -        * folios, and a large folio may only be partially in zswap.
>>>>>> -        *
>>>>>> -        * Return true without marking the folio uptodate so that an IO error is
>>>>>> -        * emitted (e.g. do_swap_page() will sigbus).
>>>>>> -        */
>>>>>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
>>>>>> -               return true;
>>>>>> -
>>>>>> -       /*
>>>>>> -        * When reading into the swapcache, invalidate our entry. The
>>>>>> -        * swapcache can be the authoritative owner of the page and
>>>>>> -        * its mappings, and the pressure that results from having two
>>>>>> -        * in-memory copies outweighs any benefits of caching the
>>>>>> -        * compression work.
>>>>>> -        *
>>>>>> -        * (Most swapins go through the swapcache. The notable
>>>>>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
>>>>>> -        * files, which reads into a private page and may free it if
>>>>>> -        * the fault fails. We remain the primary owner of the entry.)
>>>>>> -        */
>>>>>> -       if (swapcache)
>>>>>> -               entry = xa_erase(tree, offset);
>>>>>> -       else
>>>>>> -               entry = xa_load(tree, offset);
>>>>>> -
>>>>>> -       if (!entry)
>>>>>> +       if (!zswap_present_test(folio->swap, nr_pages))
>>>>>>                 return false;
>>>>>
>>>>> Hi Usama,
>>>>>
>>>>> Is there any chance that zswap_present_test() returns true
>>>>> in do_swap_page() but false in zswap_load()? If that’s
>>>>> possible, could we be missing something? For example,
>>>>> could it be that zswap has been partially released (with
>>>>> part of it still present) during an mTHP swap-in?
>>>>>
>>>>> If this happens with an mTHP, my understanding is that
>>>>> we shouldn't proceed with reading corrupted data from the
>>>>> disk backend.
>>>>>
>>>>
>>>> If its not swapcache, the zswap entry is not deleted so I think
>>>> it should be ok?
>>>>
>>>> We can check over here if the entire folio is in zswap,
>>>> and if not, return true without marking the folio uptodate
>>>> to give an error.
>>>
>>> We have swapcache_prepare() called in do_swap_page(), which should
>>> have protected these entries from being partially freed by other processes
>>> (for example, if someone falls back to small folios for the same address).
>>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in
>>> the current case where only synchronous I/O is supported.
>>>
>>> the below might help detect the bug?
>>>
>>> if (!zswap_present_test(folio->swap, nr_pages)) {
>>>      if (WARN_ON_ONCE(nr_pages > 1))
>>>                 return true;
>>>      return false;
>>> }
>>>
>>
>> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap,
>> it should still return false. So would need to check the whole folio if we want to
>> warn. But I think if we are sure the code is ok, it is an unnecessary check.
> 
> my point is that zswap_present_test() can't differentiate
> 1. the *whole* folio is not in zswap
> 2. the folio is *partially* not in zswap
> 
> in case 2, returning false is wrong.
> 

Agreed!

> And when nr_pages > 1, we have already confirmed earlier in
> do_swap_page() that zswap_present_test() is true. At this point,
> it must always be true; if it's false, it indicates a bug.
> 

Yes agreed! I was thinking from just zswap_load perspective irrespective
of who calls it.
If someone adds large folio support to swapin_readahead, then I think the
above warn might be an issue.

But just with this patch series, doing what you suggested is correct. I
will add it in next revision. We can deal with it once swap count > 1,
starts supporting large folios.

>>
>>> the code seems quite ugly :-) do we have some way to unify the code
>>> for large and small folios?
>>>
>>> not quite sure about shmem though....
>>>
>>
>> If its shmem, and the swap_count goes to 1, I think its still ok? because
>> then the folio will be gotten from swap_cache_get_folio if it has already
>> been in swapcache.
>>
>>>>
>>>>
>>>>>>
>>>>>> -       zswap_decompress(entry, &folio->page);
>>>>>> +       for (i = 0; i < nr_pages; ++i) {
>>>>>> +               tree = swap_zswap_tree(swp_entry(type, offset + i));
>>>>>> +               /*
>>>>>> +                * When reading into the swapcache, invalidate our entry. The
>>>>>> +                * swapcache can be the authoritative owner of the page and
>>>>>> +                * its mappings, and the pressure that results from having two
>>>>>> +                * in-memory copies outweighs any benefits of caching the
>>>>>> +                * compression work.
>>>>>> +                *
>>>>>> +                * (Swapins with swap count > 1 go through the swapcache.
>>>>>> +                * For swap count == 1, the swapcache is skipped and we
>>>>>> +                * remain the primary owner of the entry.)
>>>>>> +                */
>>>>>> +               if (swapcache)
>>>>>> +                       entry = xa_erase(tree, offset + i);
>>>>>> +               else
>>>>>> +                       entry = xa_load(tree, offset + i);
>>>>>>
>>>>>> -       count_vm_event(ZSWPIN);
>>>>>> -       if (entry->objcg)
>>>>>> -               count_objcg_events(entry->objcg, ZSWPIN, 1);
>>>>>> +               zswap_decompress(entry, folio_page(folio, i));
>>>>>>
>>>>>> -       if (swapcache) {
>>>>>> -               zswap_entry_free(entry);
>>>>>> -               folio_mark_dirty(folio);
>>>>>> +               if (entry->objcg)
>>>>>> +                       count_objcg_events(entry->objcg, ZSWPIN, 1);
>>>>>> +               if (swapcache)
>>>>>> +                       zswap_entry_free(entry);
>>>>>>         }
>>>>>>
>>>>>> +       count_vm_events(ZSWPIN, nr_pages);
>>>>>> +       if (swapcache)
>>>>>> +               folio_mark_dirty(folio);
>>>>>> +
>>>>>>         folio_mark_uptodate(folio);
>>>>>>         return true;
>>>>>>  }
>>>>>> --
>>>>>> 2.43.5
>>>>>>
>>>>>
>>>
>>> Thanks
>>> barry
>>
Yosry Ahmed Oct. 21, 2024, 9:34 p.m. UTC | #7
On Mon, Oct 21, 2024 at 1:57 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 21/10/2024 21:28, Barry Song wrote:
> > On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 21/10/2024 11:55, Barry Song wrote:
> >>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 21/10/2024 06:49, Barry Song wrote:
> >>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire
> >>>>>> folio is in zswap to determine folio order.
> >>>>>> During swap_read_folio, zswap_load will check if the entire folio
> >>>>>> is in zswap, and if it is, it will iterate through the pages in
> >>>>>> folio and decompress them.
> >>>>>> This will mean the benefits of large folios (fewer page faults, batched
> >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
> >>>>>> and amd) are not lost at swap out when using zswap.
> >>>>>> This patch does not add support for hybrid backends (i.e. folios
> >>>>>> partly present swap and zswap).
> >>>>>>
> >>>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >>>>>> ---
> >>>>>>  mm/memory.c | 13 +++-------
> >>>>>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
> >>>>>>  2 files changed, 34 insertions(+), 47 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index 49d243131169..75f7b9f5fb32 100644
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> >>>>>>
> >>>>>>         /*
> >>>>>>          * swap_read_folio() can't handle the case a large folio is hybridly
> >>>>>> -        * from different backends. And they are likely corner cases. Similar
> >>>>>> -        * things might be added once zswap support large folios.
> >>>>>> +        * from different backends. And they are likely corner cases.
> >>>>>>          */
> >>>>>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> >>>>>>                 return false;
> >>>>>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> >>>>>>                 return false;
> >>>>>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
> >>>>>> +               return false;

Hmm if the entire folio is not in zswap, this will prevent the large
folio swapin, right?

Also, I think this is racy, see the comments below and in patch 1.

> >>>>>>
> >>>>>>         return true;
> >>>>>>  }
> >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>>>>>         if (unlikely(userfaultfd_armed(vma)))
> >>>>>>                 goto fallback;
> >>>>>>
> >>>>>> -       /*
> >>>>>> -        * A large swapped out folio could be partially or fully in zswap. We
> >>>>>> -        * lack handling for such cases, so fallback to swapping in order-0
> >>>>>> -        * folio.
> >>>>>> -        */
> >>>>>> -       if (!zswap_never_enabled())
> >>>>>> -               goto fallback;
> >>>>>> -
> >>>>>>         entry = pte_to_swp_entry(vmf->orig_pte);
> >>>>>>         /*
> >>>>>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>>>>> index 9cc91ae31116..a5aa86c24060 100644
> >>>>>> --- a/mm/zswap.c
> >>>>>> +++ b/mm/zswap.c
> >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
> >>>>>>
> >>>>>>  bool zswap_load(struct folio *folio)
> >>>>>>  {
> >>>>>> +       int nr_pages = folio_nr_pages(folio);
> >>>>>>         swp_entry_t swp = folio->swap;
> >>>>>> +       unsigned int type = swp_type(swp);
> >>>>>>         pgoff_t offset = swp_offset(swp);
> >>>>>>         bool swapcache = folio_test_swapcache(folio);
> >>>>>> -       struct xarray *tree = swap_zswap_tree(swp);
> >>>>>> +       struct xarray *tree;
> >>>>>>         struct zswap_entry *entry;
> >>>>>> +       int i;
> >>>>>>
> >>>>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>>>>>
> >>>>>>         if (zswap_never_enabled())
> >>>>>>                 return false;
> >>>>>>
> >>>>>> -       /*
> >>>>>> -        * Large folios should not be swapped in while zswap is being used, as
> >>>>>> -        * they are not properly handled. Zswap does not properly load large
> >>>>>> -        * folios, and a large folio may only be partially in zswap.
> >>>>>> -        *
> >>>>>> -        * Return true without marking the folio uptodate so that an IO error is
> >>>>>> -        * emitted (e.g. do_swap_page() will sigbus).
> >>>>>> -        */
> >>>>>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
> >>>>>> -               return true;
> >>>>>> -
> >>>>>> -       /*
> >>>>>> -        * When reading into the swapcache, invalidate our entry. The
> >>>>>> -        * swapcache can be the authoritative owner of the page and
> >>>>>> -        * its mappings, and the pressure that results from having two
> >>>>>> -        * in-memory copies outweighs any benefits of caching the
> >>>>>> -        * compression work.
> >>>>>> -        *
> >>>>>> -        * (Most swapins go through the swapcache. The notable
> >>>>>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> >>>>>> -        * files, which reads into a private page and may free it if
> >>>>>> -        * the fault fails. We remain the primary owner of the entry.)
> >>>>>> -        */
> >>>>>> -       if (swapcache)
> >>>>>> -               entry = xa_erase(tree, offset);
> >>>>>> -       else
> >>>>>> -               entry = xa_load(tree, offset);
> >>>>>> -
> >>>>>> -       if (!entry)
> >>>>>> +       if (!zswap_present_test(folio->swap, nr_pages))
> >>>>>>                 return false;
> >>>>>
> >>>>> Hi Usama,
> >>>>>
> >>>>> Is there any chance that zswap_present_test() returns true
> >>>>> in do_swap_page() but false in zswap_load()? If that’s
> >>>>> possible, could we be missing something? For example,
> >>>>> could it be that zswap has been partially released (with
> >>>>> part of it still present) during an mTHP swap-in?

As I mentioned in patch 1, we need to document when the result of
zswap_present_test() is stable, and we can't race with other stores,
exclusive loads, writeback, or invalidation.

> >>>>>
> >>>>> If this happens with an mTHP, my understanding is that
> >>>>> we shouldn't proceed with reading corrupted data from the
> >>>>> disk backend.
> >>>>>
> >>>>
> >>>> If its not swapcache, the zswap entry is not deleted so I think
> >>>> it should be ok?

Can we race with things like writeback and other exclusive loads
because swapcache_prepare() is not called yet?

> >>>>
> >>>> We can check over here if the entire folio is in zswap,
> >>>> and if not, return true without marking the folio uptodate
> >>>> to give an error.
> >>>
> >>> We have swapcache_prepare() called in do_swap_page(), which should
> >>> have protected these entries from being partially freed by other processes
> >>> (for example, if someone falls back to small folios for the same address).
> >>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in
> >>> the current case where only synchronous I/O is supported.
> >>>
> >>> the below might help detect the bug?
> >>>
> >>> if (!zswap_present_test(folio->swap, nr_pages)) {
> >>>      if (WARN_ON_ONCE(nr_pages > 1))
> >>>                 return true;
> >>>      return false;
> >>> }
> >>>
> >>
> >> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap,
> >> it should still return false. So would need to check the whole folio if we want to
> >> warn. But I think if we are sure the code is ok, it is an unnecessary check.
> >
> > my point is that zswap_present_test() can't differentiate
> > 1. the *whole* folio is not in zswap
> > 2. the folio is *partially* not in zswap
> >
> > in case 2, returning false is wrong.
> >
>
> Agreed!
>
> > And when nr_pages > 1, we have already confirmed earlier in
> > do_swap_page() that zswap_present_test() is true. At this point,
> > it must always be true; if it's false, it indicates a bug.
> >
>
> Yes agreed! I was thinking from just zswap_load perspective irrespective
> of who calls it.
> If someone adds large folio support to swapin_readahead, then I think the
> above warn might be an issue.
>
> But just with this patch series, doing what you suggested is correct. I
> will add it in next revision. We can deal with it once swap count > 1,
> starts supporting large folios.

I think I don't follow this part of the conversation properly, but it
seems like we want to catch the case where we end up in zswap_load()
and only part of the folio is in zswap. Can we use something like the
approach we used for swap_zeromap_batch()?
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 49d243131169..75f7b9f5fb32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4077,13 +4077,14 @@  static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
 
 	/*
 	 * swap_read_folio() can't handle the case a large folio is hybridly
-	 * from different backends. And they are likely corner cases. Similar
-	 * things might be added once zswap support large folios.
+	 * from different backends. And they are likely corner cases.
 	 */
 	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
 		return false;
 	if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
 		return false;
+	if (unlikely(!zswap_present_test(entry, nr_pages)))
+		return false;
 
 	return true;
 }
@@ -4130,14 +4131,6 @@  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
-	/*
-	 * A large swapped out folio could be partially or fully in zswap. We
-	 * lack handling for such cases, so fallback to swapping in order-0
-	 * folio.
-	 */
-	if (!zswap_never_enabled())
-		goto fallback;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
diff --git a/mm/zswap.c b/mm/zswap.c
index 9cc91ae31116..a5aa86c24060 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1624,59 +1624,53 @@  bool zswap_present_test(swp_entry_t swp, int nr_pages)
 
 bool zswap_load(struct folio *folio)
 {
+	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
+	unsigned int type = swp_type(swp);
 	pgoff_t offset = swp_offset(swp);
 	bool swapcache = folio_test_swapcache(folio);
-	struct xarray *tree = swap_zswap_tree(swp);
+	struct xarray *tree;
 	struct zswap_entry *entry;
+	int i;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 
 	if (zswap_never_enabled())
 		return false;
 
-	/*
-	 * Large folios should not be swapped in while zswap is being used, as
-	 * they are not properly handled. Zswap does not properly load large
-	 * folios, and a large folio may only be partially in zswap.
-	 *
-	 * Return true without marking the folio uptodate so that an IO error is
-	 * emitted (e.g. do_swap_page() will sigbus).
-	 */
-	if (WARN_ON_ONCE(folio_test_large(folio)))
-		return true;
-
-	/*
-	 * When reading into the swapcache, invalidate our entry. The
-	 * swapcache can be the authoritative owner of the page and
-	 * its mappings, and the pressure that results from having two
-	 * in-memory copies outweighs any benefits of caching the
-	 * compression work.
-	 *
-	 * (Most swapins go through the swapcache. The notable
-	 * exception is the singleton fault on SWP_SYNCHRONOUS_IO
-	 * files, which reads into a private page and may free it if
-	 * the fault fails. We remain the primary owner of the entry.)
-	 */
-	if (swapcache)
-		entry = xa_erase(tree, offset);
-	else
-		entry = xa_load(tree, offset);
-
-	if (!entry)
+	if (!zswap_present_test(folio->swap, nr_pages))
 		return false;
 
-	zswap_decompress(entry, &folio->page);
+	for (i = 0; i < nr_pages; ++i) {
+		tree = swap_zswap_tree(swp_entry(type, offset + i));
+		/*
+		 * When reading into the swapcache, invalidate our entry. The
+		 * swapcache can be the authoritative owner of the page and
+		 * its mappings, and the pressure that results from having two
+		 * in-memory copies outweighs any benefits of caching the
+		 * compression work.
+		 *
+		 * (Swapins with swap count > 1 go through the swapcache.
+		 * For swap count == 1, the swapcache is skipped and we
+		 * remain the primary owner of the entry.)
+		 */
+		if (swapcache)
+			entry = xa_erase(tree, offset + i);
+		else
+			entry = xa_load(tree, offset + i);
 
-	count_vm_event(ZSWPIN);
-	if (entry->objcg)
-		count_objcg_events(entry->objcg, ZSWPIN, 1);
+		zswap_decompress(entry, folio_page(folio, i));
 
-	if (swapcache) {
-		zswap_entry_free(entry);
-		folio_mark_dirty(folio);
+		if (entry->objcg)
+			count_objcg_events(entry->objcg, ZSWPIN, 1);
+		if (swapcache)
+			zswap_entry_free(entry);
 	}
 
+	count_vm_events(ZSWPIN, nr_pages);
+	if (swapcache)
+		folio_mark_dirty(folio);
+
 	folio_mark_uptodate(folio);
 	return true;
 }