Message ID | 20230512235755.1589034-2-pcc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Fix bug affecting swapping in MTE tagged pages | expand |
On 13.05.23 01:57, Peter Collingbourne wrote: > Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved > the call to swap_free() before the call to set_pte_at(), which meant that > the MTE tags could end up being freed before set_pte_at() had a chance > to restore them. One other possibility was to hook arch_do_swap_page(), > but this had a number of problems: > > - The call to the hook was also after swap_free(). > > - The call to the hook was after the call to set_pte_at(), so there was a > racy window where uninitialized metadata may be exposed to userspace. > This likely also affects SPARC ADI, which implements this hook to > restore tags. > > - As a result of commit 1eba86c096e3 ("mm: change page type prior to > adding page table entry"), we were also passing the new PTE as the > oldpte argument, preventing the hook from knowing the swap index. > > Fix all of these problems by moving the arch_do_swap_page() call before > the call to free_page(), and ensuring that we do not set orig_pte until > after the call. > > Signed-off-by: Peter Collingbourne <pcc@google.com> > Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 > Cc: <stable@vger.kernel.org> # 6.1 > Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") > Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") I'm confused. You say c145e0b47c77 changed something (which was after above commits), indicate that it fixes two other commits, and indicate "6.1" as stable which does not apply to any of these commits. > --- > mm/memory.c | 26 +++++++++++++------------- > 1 file changed, 13 insertions(+), 13 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 01a23ad48a04..83268d287ff1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > } > } > > - /* > - * Remove the swap entry and conditionally try to free up the swapcache. > - * We're already holding a reference on the page but haven't mapped it > - * yet. > - */ > - swap_free(entry); > - if (should_try_to_free_swap(folio, vma, vmf->flags)) > - folio_free_swap(folio); > - > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > pte = mk_pte(page, vma->vm_page_prot); > - > /* > * Same logic as in do_wp_page(); however, optimize for pages that are > * certainly not shared either because we just allocated them without > @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > pte = pte_mksoft_dirty(pte); > if (pte_swp_uffd_wp(vmf->orig_pte)) > pte = pte_mkuffd_wp(pte); > + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > vmf->orig_pte = pte; > > + /* > + * Remove the swap entry and conditionally try to free up the swapcache. > + * We're already holding a reference on the page but haven't mapped it > + * yet. > + */ > + swap_free(entry); > + if (should_try_to_free_swap(folio, vma, vmf->flags)) > + folio_free_swap(folio); > + > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > + > /* ksm created a completely new copy */ > if (unlikely(folio != swapcache && swapcache)) { > page_add_new_anon_rmap(page, vma, vmf->address); > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > VM_BUG_ON(!folio_test_anon(folio) || > (pte_write(pte) && !PageAnonExclusive(page))); > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > folio_unlock(folio); > if (folio != swapcache && swapcache) { You are moving the folio_free_swap() call after the folio_ref_count(folio) == 1 check, which means that such (previously) swapped pages that are exclusive cannot be detected as exclusive. There must be a better way to handle MTE here. Where are the tags stored, how is the location identified, and when are they effectively restored right now?
On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > On 13.05.23 01:57, Peter Collingbourne wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 01a23ad48a04..83268d287ff1 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > } > > } > > - /* > > - * Remove the swap entry and conditionally try to free up the swapcache. > > - * We're already holding a reference on the page but haven't mapped it > > - * yet. > > - */ > > - swap_free(entry); > > - if (should_try_to_free_swap(folio, vma, vmf->flags)) > > - folio_free_swap(folio); > > - > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > > pte = mk_pte(page, vma->vm_page_prot); > > - > > /* > > * Same logic as in do_wp_page(); however, optimize for pages that are > > * certainly not shared either because we just allocated them without > > @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > pte = pte_mksoft_dirty(pte); > > if (pte_swp_uffd_wp(vmf->orig_pte)) > > pte = pte_mkuffd_wp(pte); > > + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > vmf->orig_pte = pte; > > + /* > > + * Remove the swap entry and conditionally try to free up the swapcache. > > + * We're already holding a reference on the page but haven't mapped it > > + * yet. > > + */ > > + swap_free(entry); > > + if (should_try_to_free_swap(folio, vma, vmf->flags)) > > + folio_free_swap(folio); > > + > > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > > + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > > + > > /* ksm created a completely new copy */ > > if (unlikely(folio != swapcache && swapcache)) { > > page_add_new_anon_rmap(page, vma, vmf->address); > > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > VM_BUG_ON(!folio_test_anon(folio) || > > (pte_write(pte) && !PageAnonExclusive(page))); > > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > folio_unlock(folio); > > if (folio != swapcache && swapcache) { > > > You are moving the folio_free_swap() call after the folio_ref_count(folio) > == 1 check, which means that such (previously) swapped pages that are > exclusive cannot be detected as exclusive. > > There must be a better way to handle MTE here. > > Where are the tags stored, how is the location identified, and when are they > effectively restored right now? I haven't gone through Peter's patches yet but a pretty good description of the problem is here: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. I couldn't reproduce it with my swap setup but both Qun-wei and Peter triggered it. When a tagged page is swapped out, the arm64 code stores the metadata (tags) in a local xarray indexed by the swap pte. When restoring from swap, the arm64 set_pte_at() checks this xarray using the old swap pte and spills the tags onto the new page. Apparently something changed in the kernel recently that causes swap_range_free() to be called before set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata from the xarray and the subsequent set_pte_at() won't find it. If we have the page, the metadata can be restored before set_pte_at() and I guess that's what Peter is trying to do (again, I haven't looked at the details yet; leaving it for tomorrow). Is there any other way of handling this? E.g. not release the metadata in arch_swap_invalidate_page() but later in set_pte_at() once it was restored. But then we may leak this metadata if there's no set_pte_at() (the process mapping the swap entry died).
On Mon, May 15, 2023 at 06:34:30PM +0100, Catalin Marinas wrote: > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > > On 13.05.23 01:57, Peter Collingbourne wrote: > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 01a23ad48a04..83268d287ff1 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > } > > > } > > > - /* > > > - * Remove the swap entry and conditionally try to free up the swapcache. > > > - * We're already holding a reference on the page but haven't mapped it > > > - * yet. > > > - */ > > > - swap_free(entry); > > > - if (should_try_to_free_swap(folio, vma, vmf->flags)) > > > - folio_free_swap(folio); > > > - > > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > > > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > > > pte = mk_pte(page, vma->vm_page_prot); > > > - > > > /* > > > * Same logic as in do_wp_page(); however, optimize for pages that are > > > * certainly not shared either because we just allocated them without > > > @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > pte = pte_mksoft_dirty(pte); > > > if (pte_swp_uffd_wp(vmf->orig_pte)) > > > pte = pte_mkuffd_wp(pte); > > > + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > > vmf->orig_pte = pte; > > > + /* > > > + * Remove the swap entry and conditionally try to free up the swapcache. > > > + * We're already holding a reference on the page but haven't mapped it > > > + * yet. > > > + */ > > > + swap_free(entry); > > > + if (should_try_to_free_swap(folio, vma, vmf->flags)) > > > + folio_free_swap(folio); > > > + > > > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > > > + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > > > + > > > /* ksm created a completely new copy */ > > > if (unlikely(folio != swapcache && swapcache)) { > > > page_add_new_anon_rmap(page, vma, vmf->address); > > > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > VM_BUG_ON(!folio_test_anon(folio) || > > > (pte_write(pte) && !PageAnonExclusive(page))); > > > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > > > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > > folio_unlock(folio); > > > if (folio != swapcache && swapcache) { > > > > > > You are moving the folio_free_swap() call after the folio_ref_count(folio) > > == 1 check, which means that such (previously) swapped pages that are > > exclusive cannot be detected as exclusive. > > > > There must be a better way to handle MTE here. > > > > Where are the tags stored, how is the location identified, and when are they > > effectively restored right now? > > I haven't gone through Peter's patches yet but a pretty good description > of the problem is here: > https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. > I couldn't reproduce it with my swap setup but both Qun-wei and Peter > triggered it. In order to reproduce this bug it is necessary for the swap slot cache to be disabled, which is unlikely to occur during normal operation. I was only able to reproduce the bug by disabling it forcefully with the following patch: diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0bec1f705f8e0..25afba16980c7 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -79,7 +79,7 @@ void disable_swap_slots_cache_lock(void) static void __reenable_swap_slots_cache(void) { - swap_slot_cache_enabled = has_usable_swap(); + swap_slot_cache_enabled = false; } void reenable_swap_slots_cache_unlock(void) With that I can trigger the bug on an MTE-utilizing process by running a program that enumerates the process's private anonymous mappings and calls process_madvise(MADV_PAGEOUT) on all of them. > When a tagged page is swapped out, the arm64 code stores the metadata > (tags) in a local xarray indexed by the swap pte. When restoring from > swap, the arm64 set_pte_at() checks this xarray using the old swap pte > and spills the tags onto the new page. Apparently something changed in > the kernel recently that causes swap_range_free() to be called before > set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata > from the xarray and the subsequent set_pte_at() won't find it. > > If we have the page, the metadata can be restored before set_pte_at() > and I guess that's what Peter is trying to do (again, I haven't looked > at the details yet; leaving it for tomorrow). > > Is there any other way of handling this? E.g. not release the metadata > in arch_swap_invalidate_page() but later in set_pte_at() once it was > restored. But then we may leak this metadata if there's no set_pte_at() > (the process mapping the swap entry died). Another problem that I can see with this approach is that it does not respect reference counts for swap entries, and it's unclear whether that can be done in a non-racy fashion. Another approach that I considered was to move the hook to swap_readpage() as in the patch below (sorry, it only applies to an older version of Android's android14-6.1 branch and not mainline, but you get the idea). But during a stress test (running the aforementioned program that calls process_madvise(MADV_PAGEOUT) in a loop during an Android "monkey" test) I discovered the following racy use-after-free that can occur when two tasks T1 and T2 concurrently restore the same page: T1: | T2: arch_swap_readpage() | | arch_swap_readpage() -> mte_restore_tags() -> xe_load() swap_free() | | arch_swap_readpage() -> mte_restore_tags() -> mte_restore_page_tags() We can avoid it by taking the swap_info_struct::lock spinlock in mte_restore_tags(), but it seems like it would lead to lock contention. Peter diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 3f8199ba265a1..99c8be073f107 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -25,7 +25,7 @@ unsigned long mte_copy_tags_to_user(void __user *to, void *from, unsigned long n); int mte_save_tags(struct page *page); void mte_save_page_tags(const void *page_addr, void *tag_storage); -bool mte_restore_tags(swp_entry_t entry, struct page *page); +void mte_restore_tags(struct page *page); void mte_restore_page_tags(void *page_addr, const void *tag_storage); void mte_invalidate_tags(int type, pgoff_t offset); void mte_invalidate_tags_area(int type); diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 812373cff4eec..32d3c661a0eee 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1054,11 +1054,11 @@ static inline void arch_swap_invalidate_area(int type) mte_invalidate_tags_area(type); } -#define __HAVE_ARCH_SWAP_RESTORE -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) +#define __HAVE_ARCH_SWAP_READPAGE +static inline void arch_swap_readpage(struct page *page) { - if (system_supports_mte() && mte_restore_tags(entry, &folio->page)) - set_page_mte_tagged(&folio->page); + if (system_supports_mte()) + mte_restore_tags(page); } #endif /* CONFIG_ARM64_MTE */ diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 84a085d536f84..176f094ecaa1e 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -38,15 +38,6 @@ EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode); static void mte_sync_page_tags(struct page *page, pte_t old_pte, bool check_swap, bool pte_is_tagged) { - if (check_swap && is_swap_pte(old_pte)) { - swp_entry_t entry = pte_to_swp_entry(old_pte); - - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) { - set_page_mte_tagged(page); - return; - } - } - if (!pte_is_tagged) return; diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c index 70f913205db99..3fe7774f32b3c 100644 --- a/arch/arm64/mm/mteswap.c +++ b/arch/arm64/mm/mteswap.c @@ -46,21 +46,23 @@ int mte_save_tags(struct page *page) return 0; } -bool mte_restore_tags(swp_entry_t entry, struct page *page) +void mte_restore_tags(struct page *page) { + swp_entry_t entry = folio_swap_entry(page_folio(page)); void *tags = xa_load(&mte_pages, entry.val); if (!tags) - return false; + return; /* * Test PG_mte_tagged again in case it was racing with another * set_pte_at(). */ - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) { mte_restore_page_tags(page_address(page), tags); - - return true; + if (kasan_hw_tags_enabled()) + page_kasan_tag_reset(page); + } } void mte_invalidate_tags(int type, pgoff_t offset) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5f0d7d0b9471b..eea1e545595ca 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -793,8 +793,8 @@ static inline void arch_swap_invalidate_area(int type) } #endif -#ifndef __HAVE_ARCH_SWAP_RESTORE -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) +#ifndef __HAVE_ARCH_SWAP_READPAGE +static inline void arch_swap_readpage(struct page *page) { } #endif diff --git a/mm/page_io.c b/mm/page_io.c index 3a5f921b932e8..a2f53dbeca7b3 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -470,6 +470,12 @@ int swap_readpage(struct page *page, bool synchronous, } delayacct_swapin_start(); + /* + * Some architectures may have to restore extra metadata to the + * page when reading from swap. + */ + arch_swap_readpage(page); + if (frontswap_load(page) == 0) { SetPageUptodate(page); unlock_page(page); diff --git a/mm/shmem.c b/mm/shmem.c index 0b335607bf2ad..82ccf1e6efe5d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1784,12 +1784,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, } folio_wait_writeback(folio); - /* - * Some architectures may have to restore extra metadata to the - * folio after reading from swap. - */ - arch_swap_restore(swap, folio); - if (shmem_should_replace_folio(folio, gfp)) { error = shmem_replace_folio(&folio, gfp, info, index); if (error)
On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > On 13.05.23 01:57, Peter Collingbourne wrote: > > Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved > > the call to swap_free() before the call to set_pte_at(), which meant that > > the MTE tags could end up being freed before set_pte_at() had a chance > > to restore them. One other possibility was to hook arch_do_swap_page(), > > but this had a number of problems: > > > > - The call to the hook was also after swap_free(). > > > > - The call to the hook was after the call to set_pte_at(), so there was a > > racy window where uninitialized metadata may be exposed to userspace. > > This likely also affects SPARC ADI, which implements this hook to > > restore tags. > > > > - As a result of commit 1eba86c096e3 ("mm: change page type prior to > > adding page table entry"), we were also passing the new PTE as the > > oldpte argument, preventing the hook from knowing the swap index. > > > > Fix all of these problems by moving the arch_do_swap_page() call before > > the call to free_page(), and ensuring that we do not set orig_pte until > > after the call. > > > > Signed-off-by: Peter Collingbourne <pcc@google.com> > > Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > > Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 > > Cc: <stable@vger.kernel.org> # 6.1 > > Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") > > Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") > > I'm confused. You say c145e0b47c77 changed something (which was after above > commits), indicate that it fixes two other commits, and indicate "6.1" as > stable which does not apply to any of these commits. Sorry, the situation is indeed a bit confusing. - In order to make the arch_do_swap_page() hook suitable for fixing the bug introduced by c145e0b47c77, patch 1 addresses a number of issues, including fixing bugs introduced by ca827d55ebaa and 1eba86c096e3, but we haven't fixed the c145e0b47c77 bug yet, so there's no Fixes: tag for it yet. - Patch 2, relying on the fixes in patch 1, makes MTE install an arch_do_swap_page() hook (indirectly, by making arch_swap_restore() also hook arch_do_swap_page()), thereby fixing the c145e0b47c77 bug. - 6.1 is the first stable version in which all 3 commits in my Fixes: tags are present, so that is the version that I've indicated in my stable tag for this series. In theory patch 1 could be applied to older kernel versions, but it wouldn't fix any problems that we are facing with MTE (because it only fixes problems relating to the arch_do_swap_page() hook, which older kernel versions don't hook with MTE), and there are some merge conflicts if we go back further anyway. If the SPARC folks (the previous only user of this hook) want to fix these issues with ADI, they can propose their own backport. > > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > VM_BUG_ON(!folio_test_anon(folio) || > > (pte_write(pte) && !PageAnonExclusive(page))); > > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > folio_unlock(folio); > > if (folio != swapcache && swapcache) { > > > You are moving the folio_free_swap() call after the folio_ref_count(folio) > == 1 check, which means that such (previously) swapped pages that are > exclusive cannot be detected as exclusive. Ack. I will fix this in v2. Peter
On Mon, May 15, 2023 at 05:16:09PM -0700, Peter Collingbourne wrote: > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > > On 13.05.23 01:57, Peter Collingbourne wrote: > > > Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved > > > the call to swap_free() before the call to set_pte_at(), which meant that > > > the MTE tags could end up being freed before set_pte_at() had a chance > > > to restore them. One other possibility was to hook arch_do_swap_page(), > > > but this had a number of problems: > > > > > > - The call to the hook was also after swap_free(). > > > > > > - The call to the hook was after the call to set_pte_at(), so there was a > > > racy window where uninitialized metadata may be exposed to userspace. > > > This likely also affects SPARC ADI, which implements this hook to > > > restore tags. > > > > > > - As a result of commit 1eba86c096e3 ("mm: change page type prior to > > > adding page table entry"), we were also passing the new PTE as the > > > oldpte argument, preventing the hook from knowing the swap index. > > > > > > Fix all of these problems by moving the arch_do_swap_page() call before > > > the call to free_page(), and ensuring that we do not set orig_pte until > > > after the call. > > > > > > Signed-off-by: Peter Collingbourne <pcc@google.com> > > > Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > > > Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 > > > Cc: <stable@vger.kernel.org> # 6.1 > > > Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") > > > Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") > > > > I'm confused. You say c145e0b47c77 changed something (which was after above > > commits), indicate that it fixes two other commits, and indicate "6.1" as > > stable which does not apply to any of these commits. > > Sorry, the situation is indeed a bit confusing. > > - In order to make the arch_do_swap_page() hook suitable for fixing the > bug introduced by c145e0b47c77, patch 1 addresses a number of issues, > including fixing bugs introduced by ca827d55ebaa and 1eba86c096e3, > but we haven't fixed the c145e0b47c77 bug yet, so there's no Fixes: > tag for it yet. > > - Patch 2, relying on the fixes in patch 1, makes MTE install an > arch_do_swap_page() hook (indirectly, by making arch_swap_restore() > also hook arch_do_swap_page()), thereby fixing the c145e0b47c77 bug. > > - 6.1 is the first stable version in which all 3 commits in my Fixes: tags > are present, so that is the version that I've indicated in my stable > tag for this series. In theory patch 1 could be applied to older kernel > versions, but it wouldn't fix any problems that we are facing with MTE > (because it only fixes problems relating to the arch_do_swap_page() > hook, which older kernel versions don't hook with MTE), and there are > some merge conflicts if we go back further anyway. If the SPARC folks > (the previous only user of this hook) want to fix these issues with ADI, > they can propose their own backport. > > > > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > VM_BUG_ON(!folio_test_anon(folio) || > > > (pte_write(pte) && !PageAnonExclusive(page))); > > > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > > > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > > folio_unlock(folio); > > > if (folio != swapcache && swapcache) { > > > > > > You are moving the folio_free_swap() call after the folio_ref_count(folio) > > == 1 check, which means that such (previously) swapped pages that are > > exclusive cannot be detected as exclusive. > > Ack. I will fix this in v2. I gave this some thought and concluded that the added complexity needed to make this hook suitable for arm64 without breaking sparc probably isn't worth it in the end, and as I explained in patch 2, sparc ought to be moving away from this hook anyway. So in v2 I replaced patches 1 and 2 with a patch that adds a direct call to the arch_swap_restore() hook before the call to swap_free(). Peter
On 15.05.23 19:34, Catalin Marinas wrote: > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: >> On 13.05.23 01:57, Peter Collingbourne wrote: >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 01a23ad48a04..83268d287ff1 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> } >>> } >>> - /* >>> - * Remove the swap entry and conditionally try to free up the swapcache. >>> - * We're already holding a reference on the page but haven't mapped it >>> - * yet. >>> - */ >>> - swap_free(entry); >>> - if (should_try_to_free_swap(folio, vma, vmf->flags)) >>> - folio_free_swap(folio); >>> - >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); >>> pte = mk_pte(page, vma->vm_page_prot); >>> - >>> /* >>> * Same logic as in do_wp_page(); however, optimize for pages that are >>> * certainly not shared either because we just allocated them without >>> @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> pte = pte_mksoft_dirty(pte); >>> if (pte_swp_uffd_wp(vmf->orig_pte)) >>> pte = pte_mkuffd_wp(pte); >>> + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); >>> vmf->orig_pte = pte; >>> + /* >>> + * Remove the swap entry and conditionally try to free up the swapcache. >>> + * We're already holding a reference on the page but haven't mapped it >>> + * yet. >>> + */ >>> + swap_free(entry); >>> + if (should_try_to_free_swap(folio, vma, vmf->flags)) >>> + folio_free_swap(folio); >>> + >>> + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>> + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); >>> + >>> /* ksm created a completely new copy */ >>> if (unlikely(folio != swapcache && swapcache)) { >>> page_add_new_anon_rmap(page, vma, vmf->address); >>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> VM_BUG_ON(!folio_test_anon(folio) || >>> (pte_write(pte) && !PageAnonExclusive(page))); >>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); >>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); >>> folio_unlock(folio); >>> if (folio != swapcache && swapcache) { >> >> >> You are moving the folio_free_swap() call after the folio_ref_count(folio) >> == 1 check, which means that such (previously) swapped pages that are >> exclusive cannot be detected as exclusive. >> >> There must be a better way to handle MTE here. >> >> Where are the tags stored, how is the location identified, and when are they >> effectively restored right now? > > I haven't gone through Peter's patches yet but a pretty good description > of the problem is here: > https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. > I couldn't reproduce it with my swap setup but both Qun-wei and Peter > triggered it. > > When a tagged page is swapped out, the arm64 code stores the metadata > (tags) in a local xarray indexed by the swap pte. When restoring from > swap, the arm64 set_pte_at() checks this xarray using the old swap pte > and spills the tags onto the new page. Apparently something changed in > the kernel recently that causes swap_range_free() to be called before > set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata > from the xarray and the subsequent set_pte_at() won't find it. > > If we have the page, the metadata can be restored before set_pte_at() > and I guess that's what Peter is trying to do (again, I haven't looked > at the details yet; leaving it for tomorrow). Thanks for the details! I was missing that we also have a hook in swap_range_free(). > > Is there any other way of handling this? E.g. not release the metadata > in arch_swap_invalidate_page() but later in set_pte_at() once it was > restored. But then we may leak this metadata if there's no set_pte_at() > (the process mapping the swap entry died). That was my immediate thought: do we really have to hook into swap_range_free() at all? And I also wondered why we have to do this from set_pte_at() and not do this explicitly (maybe that's the other arch_* callback on the swapin path). I'll have a look at v2, maybe it can be fixed easily without having to shuffle around too much of the swapin code (which can easily break again because the dependencies are not obvious at all and even undocumented in the code).
On 16.05.23 01:40, Peter Collingbourne wrote: > On Mon, May 15, 2023 at 06:34:30PM +0100, Catalin Marinas wrote: >> On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: >>> On 13.05.23 01:57, Peter Collingbourne wrote: >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 01a23ad48a04..83268d287ff1 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> } >>>> } >>>> - /* >>>> - * Remove the swap entry and conditionally try to free up the swapcache. >>>> - * We're already holding a reference on the page but haven't mapped it >>>> - * yet. >>>> - */ >>>> - swap_free(entry); >>>> - if (should_try_to_free_swap(folio, vma, vmf->flags)) >>>> - folio_free_swap(folio); >>>> - >>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); >>>> pte = mk_pte(page, vma->vm_page_prot); >>>> - >>>> /* >>>> * Same logic as in do_wp_page(); however, optimize for pages that are >>>> * certainly not shared either because we just allocated them without >>>> @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> pte = pte_mksoft_dirty(pte); >>>> if (pte_swp_uffd_wp(vmf->orig_pte)) >>>> pte = pte_mkuffd_wp(pte); >>>> + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); >>>> vmf->orig_pte = pte; >>>> + /* >>>> + * Remove the swap entry and conditionally try to free up the swapcache. >>>> + * We're already holding a reference on the page but haven't mapped it >>>> + * yet. >>>> + */ >>>> + swap_free(entry); >>>> + if (should_try_to_free_swap(folio, vma, vmf->flags)) >>>> + folio_free_swap(folio); >>>> + >>>> + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>>> + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); >>>> + >>>> /* ksm created a completely new copy */ >>>> if (unlikely(folio != swapcache && swapcache)) { >>>> page_add_new_anon_rmap(page, vma, vmf->address); >>>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> VM_BUG_ON(!folio_test_anon(folio) || >>>> (pte_write(pte) && !PageAnonExclusive(page))); >>>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); >>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); >>>> folio_unlock(folio); >>>> if (folio != swapcache && swapcache) { >>> >>> >>> You are moving the folio_free_swap() call after the folio_ref_count(folio) >>> == 1 check, which means that such (previously) swapped pages that are >>> exclusive cannot be detected as exclusive. >>> >>> There must be a better way to handle MTE here. >>> >>> Where are the tags stored, how is the location identified, and when are they >>> effectively restored right now? >> >> I haven't gone through Peter's patches yet but a pretty good description >> of the problem is here: >> https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. >> I couldn't reproduce it with my swap setup but both Qun-wei and Peter >> triggered it. > > In order to reproduce this bug it is necessary for the swap slot cache > to be disabled, which is unlikely to occur during normal operation. I > was only able to reproduce the bug by disabling it forcefully with the > following patch: > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 0bec1f705f8e0..25afba16980c7 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -79,7 +79,7 @@ void disable_swap_slots_cache_lock(void) > > static void __reenable_swap_slots_cache(void) > { > - swap_slot_cache_enabled = has_usable_swap(); > + swap_slot_cache_enabled = false; > } > > void reenable_swap_slots_cache_unlock(void) > > With that I can trigger the bug on an MTE-utilizing process by running > a program that enumerates the process's private anonymous mappings and > calls process_madvise(MADV_PAGEOUT) on all of them. > >> When a tagged page is swapped out, the arm64 code stores the metadata >> (tags) in a local xarray indexed by the swap pte. When restoring from >> swap, the arm64 set_pte_at() checks this xarray using the old swap pte >> and spills the tags onto the new page. Apparently something changed in >> the kernel recently that causes swap_range_free() to be called before >> set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata >> from the xarray and the subsequent set_pte_at() won't find it. >> >> If we have the page, the metadata can be restored before set_pte_at() >> and I guess that's what Peter is trying to do (again, I haven't looked >> at the details yet; leaving it for tomorrow). >> >> Is there any other way of handling this? E.g. not release the metadata >> in arch_swap_invalidate_page() but later in set_pte_at() once it was >> restored. But then we may leak this metadata if there's no set_pte_at() >> (the process mapping the swap entry died). > > Another problem that I can see with this approach is that it does not > respect reference counts for swap entries, and it's unclear whether that > can be done in a non-racy fashion. > > Another approach that I considered was to move the hook to swap_readpage() > as in the patch below (sorry, it only applies to an older version > of Android's android14-6.1 branch and not mainline, but you get the > idea). But during a stress test (running the aforementioned program that > calls process_madvise(MADV_PAGEOUT) in a loop during an Android "monkey" > test) I discovered the following racy use-after-free that can occur when > two tasks T1 and T2 concurrently restore the same page: > > T1: | T2: > arch_swap_readpage() | > | arch_swap_readpage() -> mte_restore_tags() -> xe_load() > swap_free() | > | arch_swap_readpage() -> mte_restore_tags() -> mte_restore_page_tags() > > We can avoid it by taking the swap_info_struct::lock spinlock in > mte_restore_tags(), but it seems like it would lead to lock contention. > Would the idea be to fail swap_readpage() on the one that comes last, simply retrying to lookup the page? This might be a naive question, but how does MTE play along with shared anonymous pages?
On 16.05.23 02:16, Peter Collingbourne wrote: > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: >> On 13.05.23 01:57, Peter Collingbourne wrote: >>> Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved >>> the call to swap_free() before the call to set_pte_at(), which meant that >>> the MTE tags could end up being freed before set_pte_at() had a chance >>> to restore them. One other possibility was to hook arch_do_swap_page(), >>> but this had a number of problems: >>> >>> - The call to the hook was also after swap_free(). >>> >>> - The call to the hook was after the call to set_pte_at(), so there was a >>> racy window where uninitialized metadata may be exposed to userspace. >>> This likely also affects SPARC ADI, which implements this hook to >>> restore tags. >>> >>> - As a result of commit 1eba86c096e3 ("mm: change page type prior to >>> adding page table entry"), we were also passing the new PTE as the >>> oldpte argument, preventing the hook from knowing the swap index. >>> >>> Fix all of these problems by moving the arch_do_swap_page() call before >>> the call to free_page(), and ensuring that we do not set orig_pte until >>> after the call. >>> >>> Signed-off-by: Peter Collingbourne <pcc@google.com> >>> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> >>> Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 >>> Cc: <stable@vger.kernel.org> # 6.1 >>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >>> Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") >> >> I'm confused. You say c145e0b47c77 changed something (which was after above >> commits), indicate that it fixes two other commits, and indicate "6.1" as >> stable which does not apply to any of these commits. > > Sorry, the situation is indeed a bit confusing. > > - In order to make the arch_do_swap_page() hook suitable for fixing the > bug introduced by c145e0b47c77, patch 1 addresses a number of issues, > including fixing bugs introduced by ca827d55ebaa and 1eba86c096e3, > but we haven't fixed the c145e0b47c77 bug yet, so there's no Fixes: > tag for it yet. > > - Patch 2, relying on the fixes in patch 1, makes MTE install an > arch_do_swap_page() hook (indirectly, by making arch_swap_restore() > also hook arch_do_swap_page()), thereby fixing the c145e0b47c77 bug. > Oh. That's indeed confusing. Maybe that should all be squashed to have one logical fix for the overall problem. It's especially confusing because this patch here fixes the other two issues touches code moved by c145e0b47c77. > - 6.1 is the first stable version in which all 3 commits in my Fixes: tags > are present, so that is the version that I've indicated in my stable > tag for this series. In theory patch 1 could be applied to older kernel > versions, but it wouldn't fix any problems that we are facing with MTE > (because it only fixes problems relating to the arch_do_swap_page() > hook, which older kernel versions don't hook with MTE), and there are > some merge conflicts if we go back further anyway. If the SPARC folks > (the previous only user of this hook) want to fix these issues with ADI, > they can propose their own backport. Sometimes, it's a good idea to not specify a stable version and rather let the Fixes: tags imply that.
On Tue, May 16, 2023 at 5:31 AM David Hildenbrand <david@redhat.com> wrote: > > On 15.05.23 19:34, Catalin Marinas wrote: > > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > >> On 13.05.23 01:57, Peter Collingbourne wrote: > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index 01a23ad48a04..83268d287ff1 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> } > >>> } > >>> - /* > >>> - * Remove the swap entry and conditionally try to free up the swapcache. > >>> - * We're already holding a reference on the page but haven't mapped it > >>> - * yet. > >>> - */ > >>> - swap_free(entry); > >>> - if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>> - folio_free_swap(folio); > >>> - > >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>> pte = mk_pte(page, vma->vm_page_prot); > >>> - > >>> /* > >>> * Same logic as in do_wp_page(); however, optimize for pages that are > >>> * certainly not shared either because we just allocated them without > >>> @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> pte = pte_mksoft_dirty(pte); > >>> if (pte_swp_uffd_wp(vmf->orig_pte)) > >>> pte = pte_mkuffd_wp(pte); > >>> + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > >>> vmf->orig_pte = pte; > >>> + /* > >>> + * Remove the swap entry and conditionally try to free up the swapcache. > >>> + * We're already holding a reference on the page but haven't mapped it > >>> + * yet. > >>> + */ > >>> + swap_free(entry); > >>> + if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>> + folio_free_swap(folio); > >>> + > >>> + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>> + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>> + > >>> /* ksm created a completely new copy */ > >>> if (unlikely(folio != swapcache && swapcache)) { > >>> page_add_new_anon_rmap(page, vma, vmf->address); > >>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> VM_BUG_ON(!folio_test_anon(folio) || > >>> (pte_write(pte) && !PageAnonExclusive(page))); > >>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > >>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > >>> folio_unlock(folio); > >>> if (folio != swapcache && swapcache) { > >> > >> > >> You are moving the folio_free_swap() call after the folio_ref_count(folio) > >> == 1 check, which means that such (previously) swapped pages that are > >> exclusive cannot be detected as exclusive. > >> > >> There must be a better way to handle MTE here. > >> > >> Where are the tags stored, how is the location identified, and when are they > >> effectively restored right now? > > > > I haven't gone through Peter's patches yet but a pretty good description > > of the problem is here: > > https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. > > I couldn't reproduce it with my swap setup but both Qun-wei and Peter > > triggered it. > > > > When a tagged page is swapped out, the arm64 code stores the metadata > > (tags) in a local xarray indexed by the swap pte. When restoring from > > swap, the arm64 set_pte_at() checks this xarray using the old swap pte > > and spills the tags onto the new page. Apparently something changed in > > the kernel recently that causes swap_range_free() to be called before > > set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata > > from the xarray and the subsequent set_pte_at() won't find it. > > > > If we have the page, the metadata can be restored before set_pte_at() > > and I guess that's what Peter is trying to do (again, I haven't looked > > at the details yet; leaving it for tomorrow). > > Thanks for the details! I was missing that we also have a hook in > swap_range_free(). > > > > > Is there any other way of handling this? E.g. not release the metadata > > in arch_swap_invalidate_page() but later in set_pte_at() once it was > > restored. But then we may leak this metadata if there's no set_pte_at() > > (the process mapping the swap entry died). > > That was my immediate thought: do we really have to hook into > swap_range_free() at all? As I alluded to in another reply, without the hook in swap_range_free() I think we would either end up with a race or an effective memory leak in the arch code that maintains the metadata for swapped out pages, as there would be no way for the arch-specific code to know when it is safe to free it after swapin. > And I also wondered why we have to do this > from set_pte_at() and not do this explicitly (maybe that's the other > arch_* callback on the swapin path). I don't think it's necessary, as the set_pte_at() call sites for swapped in pages are known. I'd much rather do this via an explicit hook at those call sites, as the existing approach of implicit restoring seems too subtle and easy to be overlooked when refactoring, as we have seen with this bug. In the end we only have 3 call sites for the hook and hopefully the comments that I'm adding are sufficient to ensure that any new swapin code should end up with a call to the hook in the right place. Peter
On Tue, May 16, 2023 at 5:35 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.05.23 01:40, Peter Collingbourne wrote: > > On Mon, May 15, 2023 at 06:34:30PM +0100, Catalin Marinas wrote: > >> On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > >>> On 13.05.23 01:57, Peter Collingbourne wrote: > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index 01a23ad48a04..83268d287ff1 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> } > >>>> } > >>>> - /* > >>>> - * Remove the swap entry and conditionally try to free up the swapcache. > >>>> - * We're already holding a reference on the page but haven't mapped it > >>>> - * yet. > >>>> - */ > >>>> - swap_free(entry); > >>>> - if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>>> - folio_free_swap(folio); > >>>> - > >>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>>> pte = mk_pte(page, vma->vm_page_prot); > >>>> - > >>>> /* > >>>> * Same logic as in do_wp_page(); however, optimize for pages that are > >>>> * certainly not shared either because we just allocated them without > >>>> @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> pte = pte_mksoft_dirty(pte); > >>>> if (pte_swp_uffd_wp(vmf->orig_pte)) > >>>> pte = pte_mkuffd_wp(pte); > >>>> + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > >>>> vmf->orig_pte = pte; > >>>> + /* > >>>> + * Remove the swap entry and conditionally try to free up the swapcache. > >>>> + * We're already holding a reference on the page but haven't mapped it > >>>> + * yet. > >>>> + */ > >>>> + swap_free(entry); > >>>> + if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>>> + folio_free_swap(folio); > >>>> + > >>>> + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>>> + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>>> + > >>>> /* ksm created a completely new copy */ > >>>> if (unlikely(folio != swapcache && swapcache)) { > >>>> page_add_new_anon_rmap(page, vma, vmf->address); > >>>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> VM_BUG_ON(!folio_test_anon(folio) || > >>>> (pte_write(pte) && !PageAnonExclusive(page))); > >>>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > >>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > >>>> folio_unlock(folio); > >>>> if (folio != swapcache && swapcache) { > >>> > >>> > >>> You are moving the folio_free_swap() call after the folio_ref_count(folio) > >>> == 1 check, which means that such (previously) swapped pages that are > >>> exclusive cannot be detected as exclusive. > >>> > >>> There must be a better way to handle MTE here. > >>> > >>> Where are the tags stored, how is the location identified, and when are they > >>> effectively restored right now? > >> > >> I haven't gone through Peter's patches yet but a pretty good description > >> of the problem is here: > >> https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/. > >> I couldn't reproduce it with my swap setup but both Qun-wei and Peter > >> triggered it. > > > > In order to reproduce this bug it is necessary for the swap slot cache > > to be disabled, which is unlikely to occur during normal operation. I > > was only able to reproduce the bug by disabling it forcefully with the > > following patch: > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > index 0bec1f705f8e0..25afba16980c7 100644 > > --- a/mm/swap_slots.c > > +++ b/mm/swap_slots.c > > @@ -79,7 +79,7 @@ void disable_swap_slots_cache_lock(void) > > > > static void __reenable_swap_slots_cache(void) > > { > > - swap_slot_cache_enabled = has_usable_swap(); > > + swap_slot_cache_enabled = false; > > } > > > > void reenable_swap_slots_cache_unlock(void) > > > > With that I can trigger the bug on an MTE-utilizing process by running > > a program that enumerates the process's private anonymous mappings and > > calls process_madvise(MADV_PAGEOUT) on all of them. > > > >> When a tagged page is swapped out, the arm64 code stores the metadata > >> (tags) in a local xarray indexed by the swap pte. When restoring from > >> swap, the arm64 set_pte_at() checks this xarray using the old swap pte > >> and spills the tags onto the new page. Apparently something changed in > >> the kernel recently that causes swap_range_free() to be called before > >> set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata > >> from the xarray and the subsequent set_pte_at() won't find it. > >> > >> If we have the page, the metadata can be restored before set_pte_at() > >> and I guess that's what Peter is trying to do (again, I haven't looked > >> at the details yet; leaving it for tomorrow). > >> > >> Is there any other way of handling this? E.g. not release the metadata > >> in arch_swap_invalidate_page() but later in set_pte_at() once it was > >> restored. But then we may leak this metadata if there's no set_pte_at() > >> (the process mapping the swap entry died). > > > > Another problem that I can see with this approach is that it does not > > respect reference counts for swap entries, and it's unclear whether that > > can be done in a non-racy fashion. > > > > Another approach that I considered was to move the hook to swap_readpage() > > as in the patch below (sorry, it only applies to an older version > > of Android's android14-6.1 branch and not mainline, but you get the > > idea). But during a stress test (running the aforementioned program that > > calls process_madvise(MADV_PAGEOUT) in a loop during an Android "monkey" > > test) I discovered the following racy use-after-free that can occur when > > two tasks T1 and T2 concurrently restore the same page: > > > > T1: | T2: > > arch_swap_readpage() | > > | arch_swap_readpage() -> mte_restore_tags() -> xe_load() > > swap_free() | > > | arch_swap_readpage() -> mte_restore_tags() -> mte_restore_page_tags() > > > > We can avoid it by taking the swap_info_struct::lock spinlock in > > mte_restore_tags(), but it seems like it would lead to lock contention. > > > > Would the idea be to fail swap_readpage() on the one that comes last, > simply retrying to lookup the page? The idea would be that T2's arch_swap_readpage() could potentially not find tags if it ran after swap_free(), so T2 would produce a page without restored tags. But that wouldn't matter, because T1 reaching swap_free() means that T2 will follow the goto at [1] after waiting for T1 to unlock at [2], and T2's page will be discarded. > This might be a naive question, but how does MTE play along with shared > anonymous pages? It should work fine. shmem_writepage() calls swap_writepage() which calls arch_prepare_to_swap() to write the tags. And shmem_swapin_folio() has a call to arch_swap_restore() to restore them. Peter [1] https://github.com/torvalds/linux/blob/f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6/mm/memory.c#L3881 [2] https://github.com/torvalds/linux/blob/f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6/mm/memory.c#L4006
On Tue, May 16, 2023 at 5:40 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.05.23 02:16, Peter Collingbourne wrote: > > On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > >> On 13.05.23 01:57, Peter Collingbourne wrote: > >>> Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved > >>> the call to swap_free() before the call to set_pte_at(), which meant that > >>> the MTE tags could end up being freed before set_pte_at() had a chance > >>> to restore them. One other possibility was to hook arch_do_swap_page(), > >>> but this had a number of problems: > >>> > >>> - The call to the hook was also after swap_free(). > >>> > >>> - The call to the hook was after the call to set_pte_at(), so there was a > >>> racy window where uninitialized metadata may be exposed to userspace. > >>> This likely also affects SPARC ADI, which implements this hook to > >>> restore tags. > >>> > >>> - As a result of commit 1eba86c096e3 ("mm: change page type prior to > >>> adding page table entry"), we were also passing the new PTE as the > >>> oldpte argument, preventing the hook from knowing the swap index. > >>> > >>> Fix all of these problems by moving the arch_do_swap_page() call before > >>> the call to free_page(), and ensuring that we do not set orig_pte until > >>> after the call. > >>> > >>> Signed-off-by: Peter Collingbourne <pcc@google.com> > >>> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > >>> Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 > >>> Cc: <stable@vger.kernel.org> # 6.1 > >>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") > >>> Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") > >> > >> I'm confused. You say c145e0b47c77 changed something (which was after above > >> commits), indicate that it fixes two other commits, and indicate "6.1" as > >> stable which does not apply to any of these commits. > > > > Sorry, the situation is indeed a bit confusing. > > > > - In order to make the arch_do_swap_page() hook suitable for fixing the > > bug introduced by c145e0b47c77, patch 1 addresses a number of issues, > > including fixing bugs introduced by ca827d55ebaa and 1eba86c096e3, > > but we haven't fixed the c145e0b47c77 bug yet, so there's no Fixes: > > tag for it yet. > > > > - Patch 2, relying on the fixes in patch 1, makes MTE install an > > arch_do_swap_page() hook (indirectly, by making arch_swap_restore() > > also hook arch_do_swap_page()), thereby fixing the c145e0b47c77 bug. > > > > Oh. That's indeed confusing. Maybe that should all be squashed to have > one logical fix for the overall problem. It's especially confusing > because this patch here fixes the other two issues touches code moved by > c145e0b47c77. Maybe. It can sometimes be hard to reconcile "one logical change per patch" with "bug requires more than one logical change to fix" though. Fortunately in this case I think we have an approach that fixes the bug in one logical change, with some followup patches to clean things up. > > - 6.1 is the first stable version in which all 3 commits in my Fixes: tags > > are present, so that is the version that I've indicated in my stable > > tag for this series. In theory patch 1 could be applied to older kernel > > versions, but it wouldn't fix any problems that we are facing with MTE > > (because it only fixes problems relating to the arch_do_swap_page() > > hook, which older kernel versions don't hook with MTE), and there are > > some merge conflicts if we go back further anyway. If the SPARC folks > > (the previous only user of this hook) want to fix these issues with ADI, > > they can propose their own backport. > > Sometimes, it's a good idea to not specify a stable version and rather > let the Fixes: tags imply that. Yeah, but sometimes it's hard to say which way would be more efficient. Either we spend time discussing why the version is necessary or Greg spends time trying to apply patches to the wrong trees because I wasn't more explicit... Peter
>> Would the idea be to fail swap_readpage() on the one that comes last, >> simply retrying to lookup the page? > > The idea would be that T2's arch_swap_readpage() could potentially not > find tags if it ran after swap_free(), so T2 would produce a page > without restored tags. But that wouldn't matter, because T1 reaching > swap_free() means that T2 will follow the goto at [1] after waiting > for T1 to unlock at [2], and T2's page will be discarded. Ah, right. > >> This might be a naive question, but how does MTE play along with shared >> anonymous pages? > > It should work fine. shmem_writepage() calls swap_writepage() which > calls arch_prepare_to_swap() to write the tags. And > shmem_swapin_folio() has a call to arch_swap_restore() to restore > them. Sorry, I meant actual anonymous memory pages, not shmem. Like, anonymous pages that are COW-shared due to fork() or KSM. How does MTE, in general, interact with that? Assume one process ends up modifying the tags ... and the page is COW-shared with a different process that should not observe these tag modifications.
>>> Is there any other way of handling this? E.g. not release the metadata >>> in arch_swap_invalidate_page() but later in set_pte_at() once it was >>> restored. But then we may leak this metadata if there's no set_pte_at() >>> (the process mapping the swap entry died). >> >> That was my immediate thought: do we really have to hook into >> swap_range_free() at all? > > As I alluded to in another reply, without the hook in > swap_range_free() I think we would either end up with a race or an > effective memory leak in the arch code that maintains the metadata for > swapped out pages, as there would be no way for the arch-specific code > to know when it is safe to free it after swapin. Agreed, hooking swap_range_free() is actually cleaner (also considering COW-shared pages). > >> And I also wondered why we have to do this >> from set_pte_at() and not do this explicitly (maybe that's the other >> arch_* callback on the swapin path). > > I don't think it's necessary, as the set_pte_at() call sites for > swapped in pages are known. I'd much rather do this via an explicit > hook at those call sites, as the existing approach of implicit > restoring seems too subtle and easy to be overlooked when refactoring, > as we have seen with this bug. In the end we only have 3 call sites > for the hook and hopefully the comments that I'm adding are sufficient > to ensure that any new swapin code should end up with a call to the > hook in the right place. Agreed, much cleaner, thanks!
On 16.05.23 04:35, Peter Collingbourne wrote: > On Mon, May 15, 2023 at 05:16:09PM -0700, Peter Collingbourne wrote: >> On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: >>> On 13.05.23 01:57, Peter Collingbourne wrote: >>>> Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved >>>> the call to swap_free() before the call to set_pte_at(), which meant that >>>> the MTE tags could end up being freed before set_pte_at() had a chance >>>> to restore them. One other possibility was to hook arch_do_swap_page(), >>>> but this had a number of problems: >>>> >>>> - The call to the hook was also after swap_free(). >>>> >>>> - The call to the hook was after the call to set_pte_at(), so there was a >>>> racy window where uninitialized metadata may be exposed to userspace. >>>> This likely also affects SPARC ADI, which implements this hook to >>>> restore tags. >>>> >>>> - As a result of commit 1eba86c096e3 ("mm: change page type prior to >>>> adding page table entry"), we were also passing the new PTE as the >>>> oldpte argument, preventing the hook from knowing the swap index. >>>> >>>> Fix all of these problems by moving the arch_do_swap_page() call before >>>> the call to free_page(), and ensuring that we do not set orig_pte until >>>> after the call. >>>> >>>> Signed-off-by: Peter Collingbourne <pcc@google.com> >>>> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> >>>> Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 >>>> Cc: <stable@vger.kernel.org> # 6.1 >>>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >>>> Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") >>> >>> I'm confused. You say c145e0b47c77 changed something (which was after above >>> commits), indicate that it fixes two other commits, and indicate "6.1" as >>> stable which does not apply to any of these commits. >> >> Sorry, the situation is indeed a bit confusing. >> >> - In order to make the arch_do_swap_page() hook suitable for fixing the >> bug introduced by c145e0b47c77, patch 1 addresses a number of issues, >> including fixing bugs introduced by ca827d55ebaa and 1eba86c096e3, >> but we haven't fixed the c145e0b47c77 bug yet, so there's no Fixes: >> tag for it yet. >> >> - Patch 2, relying on the fixes in patch 1, makes MTE install an >> arch_do_swap_page() hook (indirectly, by making arch_swap_restore() >> also hook arch_do_swap_page()), thereby fixing the c145e0b47c77 bug. >> >> - 6.1 is the first stable version in which all 3 commits in my Fixes: tags >> are present, so that is the version that I've indicated in my stable >> tag for this series. In theory patch 1 could be applied to older kernel >> versions, but it wouldn't fix any problems that we are facing with MTE >> (because it only fixes problems relating to the arch_do_swap_page() >> hook, which older kernel versions don't hook with MTE), and there are >> some merge conflicts if we go back further anyway. If the SPARC folks >> (the previous only user of this hook) want to fix these issues with ADI, >> they can propose their own backport. >> >>>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> VM_BUG_ON(!folio_test_anon(folio) || >>>> (pte_write(pte) && !PageAnonExclusive(page))); >>>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); >>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); >>>> folio_unlock(folio); >>>> if (folio != swapcache && swapcache) { >>> >>> >>> You are moving the folio_free_swap() call after the folio_ref_count(folio) >>> == 1 check, which means that such (previously) swapped pages that are >>> exclusive cannot be detected as exclusive. >> >> Ack. I will fix this in v2. > > I gave this some thought and concluded that the added complexity needed > to make this hook suitable for arm64 without breaking sparc probably > isn't worth it in the end, and as I explained in patch 2, sparc ought > to be moving away from this hook anyway. So in v2 I replaced patches 1 > and 2 with a patch that adds a direct call to the arch_swap_restore() > hook before the call to swap_free(). As a side note, I recall that sparc code might be a bit fragile and eventually broken already (arch_unmap_one()): https://lkml.kernel.org/r/d98bd1f9-e9b7-049c-7bde-3348b074eb18@redhat.com
On Wed, May 17, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote: > > >> Would the idea be to fail swap_readpage() on the one that comes last, > >> simply retrying to lookup the page? > > > > The idea would be that T2's arch_swap_readpage() could potentially not > > find tags if it ran after swap_free(), so T2 would produce a page > > without restored tags. But that wouldn't matter, because T1 reaching > > swap_free() means that T2 will follow the goto at [1] after waiting > > for T1 to unlock at [2], and T2's page will be discarded. > > Ah, right. > > > > >> This might be a naive question, but how does MTE play along with shared > >> anonymous pages? > > > > It should work fine. shmem_writepage() calls swap_writepage() which > > calls arch_prepare_to_swap() to write the tags. And > > shmem_swapin_folio() has a call to arch_swap_restore() to restore > > them. > > Sorry, I meant actual anonymous memory pages, not shmem. Like, anonymous > pages that are COW-shared due to fork() or KSM. > > How does MTE, in general, interact with that? Assume one process ends up > modifying the tags ... and the page is COW-shared with a different > process that should not observe these tag modifications. Tag modifications cause write faults if the page is read-only, so for COW shared pages we would end up copying the page in the usual way, which on arm64 would copy the tags as well via the copy_highpage hook (see arch/arm64/mm/copypage.c). Peter
>> Sorry, I meant actual anonymous memory pages, not shmem. Like, anonymous >> pages that are COW-shared due to fork() or KSM. >> >> How does MTE, in general, interact with that? Assume one process ends up >> modifying the tags ... and the page is COW-shared with a different >> process that should not observe these tag modifications. > > Tag modifications cause write faults if the page is read-only, so for > COW shared pages we would end up copying the page in the usual way, > which on arm64 would copy the tags as well via the copy_highpage hook > (see arch/arm64/mm/copypage.c). Oh, that makes sense, thanks for pointing that out! ... and I can spot that KSM also checks the tag when de-duplicating: pages_identical() ends up calling memcmp_pages(), which knows how to deal with tags. Interestingly, calc_checksum() does not seem to care about tags. But that simply implies that pages with the same content have same checksum, independent of the tag. And pages_identical() is the single source of truth.
On Fri, May 19, 2023 at 11:21:35AM +0200, David Hildenbrand wrote: > > > Sorry, I meant actual anonymous memory pages, not shmem. Like, anonymous > > > pages that are COW-shared due to fork() or KSM. > > > > > > How does MTE, in general, interact with that? Assume one process ends up > > > modifying the tags ... and the page is COW-shared with a different > > > process that should not observe these tag modifications. > > > > Tag modifications cause write faults if the page is read-only, so for > > COW shared pages we would end up copying the page in the usual way, > > which on arm64 would copy the tags as well via the copy_highpage hook > > (see arch/arm64/mm/copypage.c). > > Oh, that makes sense, thanks for pointing that out! > > ... and I can spot that KSM also checks the tag when de-duplicating: > pages_identical() ends up calling memcmp_pages(), which knows how to deal > with tags. > > Interestingly, calc_checksum() does not seem to care about tags. But that > simply implies that pages with the same content have same checksum, > independent of the tag. And pages_identical() is the single source of truth. That was my assumption at the time, there would be a memcmp_pages() in case of checksum collision.
diff --git a/mm/memory.c b/mm/memory.c index 01a23ad48a04..83268d287ff1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } } - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free(entry); - if (should_try_to_free_swap(folio, vma, vmf->flags)) - folio_free_swap(folio); - - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - /* * Same logic as in do_wp_page(); however, optimize for pages that are * certainly not shared either because we just allocated them without @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) pte = pte_mkuffd_wp(pte); + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); vmf->orig_pte = pte; + /* + * Remove the swap entry and conditionally try to free up the swapcache. + * We're already holding a reference on the page but haven't mapped it + * yet. + */ + swap_free(entry); + if (should_try_to_free_swap(folio, vma, vmf->flags)) + folio_free_swap(folio); + + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); + /* ksm created a completely new copy */ if (unlikely(folio != swapcache && swapcache)) { page_add_new_anon_rmap(page, vma, vmf->address); @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_BUG_ON(!folio_test_anon(folio) || (pte_write(pte) && !PageAnonExclusive(page))); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); folio_unlock(folio); if (folio != swapcache && swapcache) {
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved the call to swap_free() before the call to set_pte_at(), which meant that the MTE tags could end up being freed before set_pte_at() had a chance to restore them. One other possibility was to hook arch_do_swap_page(), but this had a number of problems: - The call to the hook was also after swap_free(). - The call to the hook was after the call to set_pte_at(), so there was a racy window where uninitialized metadata may be exposed to userspace. This likely also affects SPARC ADI, which implements this hook to restore tags. - As a result of commit 1eba86c096e3 ("mm: change page type prior to adding page table entry"), we were also passing the new PTE as the oldpte argument, preventing the hook from knowing the swap index. Fix all of these problems by moving the arch_do_swap_page() call before the call to free_page(), and ensuring that we do not set orig_pte until after the call. Signed-off-by: Peter Collingbourne <pcc@google.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 Cc: <stable@vger.kernel.org> # 6.1 Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry") --- mm/memory.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)