Message ID | 20240403114032.1162100-3-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Swap-out mTHP without splitting | expand |
Andrew, Looks like there are a few niggles for me to address below. So please don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a new version out (or patch for this depending on how discussion lands) in the next couple of days. On 05/04/2024 11:13, David Hildenbrand wrote: > On 03.04.24 13:40, Ryan Roberts wrote: >> Now that we no longer have a convenient flag in the cluster to determine >> if a folio is large, free_swap_and_cache() will take a reference and >> lock a large folio much more often, which could lead to contention and >> (e.g.) failure to split large folios, etc. >> >> Let's solve that problem by batch freeing swap and cache with a new >> function, free_swap_and_cache_nr(), to free a contiguous range of swap >> entries together. This allows us to first drop a reference to each swap >> slot before we try to release the cache folio. This means we only try to >> release the folio once, only taking the reference and lock once - much >> better than the previous 512 times for the 2M THP case. >> >> Contiguous swap entries are gathered in zap_pte_range() and >> madvise_free_pte_range() in a similar way to how present ptes are >> already gathered in zap_pte_range(). >> >> While we are at it, let's simplify by converting the return type of both >> functions to void. The return value was used only by zap_pte_range() to >> print a bad pte, and was ignored by everyone else, so the extra >> reporting wasn't exactly guaranteed. We will still get the warning with >> most of the information from get_swap_device(). With the batch version, >> we wouldn't know which pte was bad anyway so could print the wrong one. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> include/linux/pgtable.h | 28 ++++++++++++++ >> include/linux/swap.h | 12 ++++-- >> mm/internal.h | 48 +++++++++++++++++++++++ >> mm/madvise.c | 12 ++++-- >> mm/memory.c | 13 ++++--- >> mm/swapfile.c | 86 ++++++++++++++++++++++++++++++++--------- >> 6 files changed, 167 insertions(+), 32 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a3fc8150b047..0278259f7078 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct >> mm_struct *mm, >> } >> #endif >> +#ifndef clear_not_present_full_ptes >> +/** >> + * clear_not_present_full_ptes - Clear consecutive not present PTEs. > > > Consecutive only in the page table or also in some other sense? > > I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any > offset/pfn. yes. > > Consider document that. How about: "Clear multiple not present PTEs which are consecutive in the pgtable"? > >> + * @mm: Address space the ptes represent. >> + * @addr: Address of the first pte. >> + * @ptep: Page table pointer for the first entry. >> + * @nr: Number of entries to clear. >> + * @full: Whether we are clearing a full mm. >> + * >> + * May be overridden by the architecture; otherwise, implemented as a simple >> + * loop over pte_clear_not_present_full(). >> + * >> + * Context: The caller holds the page table lock. The PTEs are all not present. >> + * The PTEs are all in the same PMD. >> + */ >> +static inline void clear_not_present_full_ptes(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, unsigned int nr, int full) >> +{ >> + for (;;) { >> + pte_clear_not_present_full(mm, addr, ptep, full); >> + if (--nr == 0) >> + break; >> + ptep++; >> + addr += PAGE_SIZE; >> + } >> +} >> +#endif >> + >> #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH >> extern pte_t ptep_clear_flush(struct vm_area_struct *vma, >> unsigned long address, >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index f6f78198f000..5737236dc3ce 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t); >> extern int swapcache_prepare(swp_entry_t); >> extern void swap_free(swp_entry_t); >> extern void swapcache_free_entries(swp_entry_t *entries, int n); >> -extern int free_swap_and_cache(swp_entry_t); >> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); >> int swap_type_of(dev_t device, sector_t offset); >> int find_first_swap(dev_t *device); >> extern unsigned int count_swap_pages(int, int); >> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct >> *si) >> #define free_pages_and_swap_cache(pages, nr) \ >> release_pages((pages), (nr)); >> > > > [...] > >> + >> +/** >> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >> + * @start_ptep: Page table pointer for the first entry. >> + * @max_nr: The maximum number of table entries to consider. >> + * @entry: Swap entry recovered from the first table entry. >> + * >> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >> + * containing swap entries all with consecutive offsets and targeting the same >> + * swap type. >> + * > > Likely you should document that any swp pte bits are ignored? () Sorry I don't understand this comment. I thought any non-none, non-present PTE was always considered to contain only a "swap entry" and a swap entry consists of a "type" and an "offset" only. (and its a special "non-swap" swap entry if type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that are not part of the swap entry? > >> + * max_nr must be at least one and must be limited by the caller so scanning >> + * cannot exceed a single page table. >> + * >> + * Return: the number of table entries in the batch. >> + */ >> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >> + swp_entry_t entry) >> +{ >> + const pte_t *end_ptep = start_ptep + max_nr; >> + unsigned long expected_offset = swp_offset(entry) + 1; >> + unsigned int expected_type = swp_type(entry); >> + pte_t *ptep = start_ptep + 1; >> + >> + VM_WARN_ON(max_nr < 1); >> + VM_WARN_ON(non_swap_entry(entry)); >> + >> + while (ptep < end_ptep) { >> + pte_t pte = ptep_get(ptep); >> + >> + if (pte_none(pte) || pte_present(pte)) >> + break; >> + >> + entry = pte_to_swp_entry(pte); >> + >> + if (non_swap_entry(entry) || >> + swp_type(entry) != expected_type || >> + swp_offset(entry) != expected_offset) >> + break; >> + >> + expected_offset++; >> + ptep++; >> + } >> + >> + return ptep - start_ptep; >> +} > > Looks very clean :) > > I was wondering whether we could similarly construct the expected swp PTE and > only check pte_same. > > expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); > > ... or have a variant to increase only the swp offset for an existing pte. But > non-trivial due to the arch-dependent format. > > But then, we'd fail on mismatch of other swp pte bits. Hmm, perhaps I have a misunderstanding regarding "swp pte bits"... > > > On swapin, when reusing this function (likely!), we'll might to make sure that > the PTE bits match as well. > > See below regarding uffd-wp. > > >> #endif /* CONFIG_MMU */ >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 1f77a51baaac..070bedb4996e 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> struct folio *folio; >> int nr_swap = 0; >> unsigned long next; >> + int nr, max_nr; >> next = pmd_addr_end(addr, end); >> if (pmd_trans_huge(*pmd)) >> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> return 0; >> flush_tlb_batched_pending(mm); >> arch_enter_lazy_mmu_mode(); >> - for (; addr != end; pte++, addr += PAGE_SIZE) { >> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >> + nr = 1; >> ptent = ptep_get(pte); >> if (pte_none(ptent)) >> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> entry = pte_to_swp_entry(ptent); >> if (!non_swap_entry(entry)) { >> - nr_swap--; >> - free_swap_and_cache(entry); >> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> + max_nr = (end - addr) / PAGE_SIZE; >> + nr = swap_pte_batch(pte, max_nr, entry); >> + nr_swap -= nr; >> + free_swap_and_cache_nr(entry, nr); >> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >> } else if (is_hwpoison_entry(entry) || >> is_poisoned_swp_entry(entry)) { >> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> diff --git a/mm/memory.c b/mm/memory.c >> index 7dc6c3d9fa83..ef2968894718 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >> *tlb, >> folio_remove_rmap_pte(folio, page, vma); >> folio_put(folio); >> } else if (!non_swap_entry(entry)) { >> - /* Genuine swap entry, hence a private anon page */ >> + max_nr = (end - addr) / PAGE_SIZE; >> + nr = swap_pte_batch(pte, max_nr, entry); >> + /* Genuine swap entries, hence a private anon pages */ >> if (!should_zap_cows(details)) >> continue; >> - rss[MM_SWAPENTS]--; >> - if (unlikely(!free_swap_and_cache(entry))) >> - print_bad_pte(vma, addr, ptent, NULL); >> + rss[MM_SWAPENTS] -= nr; >> + free_swap_and_cache_nr(entry, nr); >> } else if (is_migration_entry(entry)) { >> folio = pfn_swap_entry_folio(entry); >> if (!should_zap_folio(details, folio)) >> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >> WARN_ON_ONCE(1); >> } >> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); > > For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. > > zap_install_uffd_wp_if_needed() will use the uffd-wp information in > ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. > > On mixture, you either lose some or place too many markers. What path are you concerned about here? I don't get how what you describe can happen? swap_pte_batch() will only give me a batch of actual swap entries and actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based on the VMA state at swap-in? I think you are telling me that it's persisted across the swap per-pte? > > A simple workaround would be to disable any such batching if the VMA does have > uffd-wp enabled. > >> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >> add_mm_rss_vec(mm, rss); >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 0d44ee2b4f9c..d059de6896c1 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent) >> /* Reclaim the swap entry if swap is getting full*/ >> #define TTRS_FULL 0x4 >> -/* returns 1 if swap entry is freed */ >> +/* >> + * returns number of pages in the folio that backs the swap entry. If positive, >> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no >> + * folio was associated with the swap entry. >> + */ >> static int __try_to_reclaim_swap(struct swap_info_struct *si, >> unsigned long offset, unsigned long flags) >> { >> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, >> ret = folio_free_swap(folio); >> folio_unlock(folio); >> } >> + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); >> folio_put(folio); >> return ret; >> } >> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); >> spin_lock(&si->lock); >> /* entry was freed successfully, try to use this again */ >> - if (swap_was_freed) >> + if (swap_was_freed > 0) >> goto checks; >> goto scan; /* check next one */ >> } >> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio) >> return true; >> } >> -/* >> - * Free the swap entry like above, but also try to >> - * free the page cache entry if it is the last user. >> - */ >> -int free_swap_and_cache(swp_entry_t entry) > > Can we have some documentation what this function expects? How does nr relate to > entry? > > i.e., offset range defined by [entry.offset, entry.offset + nr). Yep, will add something along these lines. > >> +void free_swap_and_cache_nr(swp_entry_t entry, int nr) >> { >> - struct swap_info_struct *p; > > It might be easier to get if you do > > const unsigned long start_offset = swp_offset(entry); > const unsigned long end_offset = start_offset + nr; OK will do this. > >> + unsigned long end = swp_offset(entry) + nr; >> + unsigned int type = swp_type(entry); >> + struct swap_info_struct *si; >> + bool any_only_cache = false; >> + unsigned long offset; >> unsigned char count; >> if (non_swap_entry(entry)) >> - return 1; >> + return; >> - p = get_swap_device(entry); >> - if (p) { >> - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { >> - put_swap_device(p); >> - return 0; >> + si = get_swap_device(entry); >> + if (!si) >> + return; >> + >> + if (WARN_ON(end > si->max)) >> + goto out; >> + >> + /* >> + * First free all entries in the range. >> + */ >> + for (offset = swp_offset(entry); offset < end; offset++) { >> + if (!WARN_ON(data_race(!si->swap_map[offset]))) { > > Ouch "!(!)). Confusing. > > I'm sure there is a better way to write that, maybe using more lines > > if (data_race(si->swap_map[offset])) { > ... > } else { > WARN_ON_ONCE(1); > } OK, I thought it was kinda neat myself. I'll change it. > > >> + count = __swap_entry_free(si, swp_entry(type, offset)); >> + if (count == SWAP_HAS_CACHE) >> + any_only_cache = true; >> } >> + } >> + >> + /* >> + * Short-circuit the below loop if none of the entries had their >> + * reference drop to zero. >> + */ >> + if (!any_only_cache) >> + goto out; >> - count = __swap_entry_free(p, entry); >> - if (count == SWAP_HAS_CACHE) >> - __try_to_reclaim_swap(p, swp_offset(entry), >> + /* >> + * Now go back over the range trying to reclaim the swap cache. This is >> + * more efficient for large folios because we will only try to reclaim >> + * the swap once per folio in the common case. If we do >> + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the >> + * latter will get a reference and lock the folio for every individual >> + * page but will only succeed once the swap slot for every subpage is >> + * zero. >> + */ >> + for (offset = swp_offset(entry); offset < end; offset += nr) { >> + nr = 1; >> + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { > > Here we use READ_ONCE() only, above data_race(). Hmmm. Yes. I think this is correct. READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is used. I believe READ_ONCE() is required here because we don't have a lock and we want to make sure we read it in a non-tearing manner. We don't need the READ_ONCE() above since we don't care about the exact value - only that it's not 0 (because we should be holding a ref). So do a plain access to give the compiler a bit more freedom. But we need to mark that with data_race() to stop KCSAN from complaining. > >> + /* >> + * Folios are always naturally aligned in swap so >> + * advance forward to the next boundary. Zero means no >> + * folio was found for the swap entry, so advance by 1 >> + * in this case. Negative value means folio was found >> + * but could not be reclaimed. Here we can still advance >> + * to the next boundary. >> + */ >> + nr = __try_to_reclaim_swap(si, offset, >> TTRS_UNMAPPED | TTRS_FULL); >> - put_swap_device(p); >> + if (nr == 0) >> + nr = 1; >> + else if (nr < 0) >> + nr = -nr; >> + nr = ALIGN(offset + 1, nr) - offset; >> + } > > Apart from that nothing jumped at me. Thanks for the review!
>>> + >>> +/** >>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>> + * @start_ptep: Page table pointer for the first entry. >>> + * @max_nr: The maximum number of table entries to consider. >>> + * @entry: Swap entry recovered from the first table entry. >>> + * >>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>> + * containing swap entries all with consecutive offsets and targeting the same >>> + * swap type. >>> + * >> >> Likely you should document that any swp pte bits are ignored? () > > Sorry I don't understand this comment. I thought any non-none, non-present PTE > was always considered to contain only a "swap entry" and a swap entry consists > of a "type" and an "offset" only. (and its a special "non-swap" swap entry if > type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that > are not part of the swap entry? pte_swp_soft_dirty() pte_swp_clear_exclusive() pte_swp_uffd_wp() Are PTE bits used for swp PTE. There is also dirty/young for migration entries, but that's not of a concern here, because we stop for non_swap_entry(). > > >> >>> + * max_nr must be at least one and must be limited by the caller so scanning >>> + * cannot exceed a single page table. >>> + * >>> + * Return: the number of table entries in the batch. >>> + */ >>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>> + swp_entry_t entry) >>> +{ >>> + const pte_t *end_ptep = start_ptep + max_nr; >>> + unsigned long expected_offset = swp_offset(entry) + 1; >>> + unsigned int expected_type = swp_type(entry); >>> + pte_t *ptep = start_ptep + 1; >>> + >>> + VM_WARN_ON(max_nr < 1); >>> + VM_WARN_ON(non_swap_entry(entry)); >>> + >>> + while (ptep < end_ptep) { >>> + pte_t pte = ptep_get(ptep); >>> + >>> + if (pte_none(pte) || pte_present(pte)) >>> + break; >>> + >>> + entry = pte_to_swp_entry(pte); >>> + >>> + if (non_swap_entry(entry) || >>> + swp_type(entry) != expected_type || >>> + swp_offset(entry) != expected_offset) >>> + break; >>> + >>> + expected_offset++; >>> + ptep++; >>> + } >>> + >>> + return ptep - start_ptep; >>> +} >> >> Looks very clean :) >> >> I was wondering whether we could similarly construct the expected swp PTE and >> only check pte_same. >> >> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); >> >> ... or have a variant to increase only the swp offset for an existing pte. But >> non-trivial due to the arch-dependent format. >> >> But then, we'd fail on mismatch of other swp pte bits. > > Hmm, perhaps I have a misunderstanding regarding "swp pte bits"... > >> >> >> On swapin, when reusing this function (likely!), we'll might to make sure that >> the PTE bits match as well. >> >> See below regarding uffd-wp. >> >> >>> #endif /* CONFIG_MMU */ >>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >>> diff --git a/mm/madvise.c b/mm/madvise.c >>> index 1f77a51baaac..070bedb4996e 100644 >>> --- a/mm/madvise.c >>> +++ b/mm/madvise.c >>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> struct folio *folio; >>> int nr_swap = 0; >>> unsigned long next; >>> + int nr, max_nr; >>> next = pmd_addr_end(addr, end); >>> if (pmd_trans_huge(*pmd)) >>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> return 0; >>> flush_tlb_batched_pending(mm); >>> arch_enter_lazy_mmu_mode(); >>> - for (; addr != end; pte++, addr += PAGE_SIZE) { >>> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >>> + nr = 1; >>> ptent = ptep_get(pte); >>> if (pte_none(ptent)) >>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> entry = pte_to_swp_entry(ptent); >>> if (!non_swap_entry(entry)) { >>> - nr_swap--; >>> - free_swap_and_cache(entry); >>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> + max_nr = (end - addr) / PAGE_SIZE; >>> + nr = swap_pte_batch(pte, max_nr, entry); >>> + nr_swap -= nr; >>> + free_swap_and_cache_nr(entry, nr); >>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>> } else if (is_hwpoison_entry(entry) || >>> is_poisoned_swp_entry(entry)) { >>> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 7dc6c3d9fa83..ef2968894718 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >>> *tlb, >>> folio_remove_rmap_pte(folio, page, vma); >>> folio_put(folio); >>> } else if (!non_swap_entry(entry)) { >>> - /* Genuine swap entry, hence a private anon page */ >>> + max_nr = (end - addr) / PAGE_SIZE; >>> + nr = swap_pte_batch(pte, max_nr, entry); >>> + /* Genuine swap entries, hence a private anon pages */ >>> if (!should_zap_cows(details)) >>> continue; >>> - rss[MM_SWAPENTS]--; >>> - if (unlikely(!free_swap_and_cache(entry))) >>> - print_bad_pte(vma, addr, ptent, NULL); >>> + rss[MM_SWAPENTS] -= nr; >>> + free_swap_and_cache_nr(entry, nr); >>> } else if (is_migration_entry(entry)) { >>> folio = pfn_swap_entry_folio(entry); >>> if (!should_zap_folio(details, folio)) >>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>> WARN_ON_ONCE(1); >>> } >>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >> >> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. >> >> zap_install_uffd_wp_if_needed() will use the uffd-wp information in >> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. >> >> On mixture, you either lose some or place too many markers. > > What path are you concerned about here? I don't get how what you describe can > happen? swap_pte_batch() will only give me a batch of actual swap entries and > actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a > "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based > on the VMA state at swap-in? I think you are telling me that it's persisted > across the swap per-pte? Please see zap_install_uffd_wp_if_needed(): if (unlikely(pte_swp_uffd_wp_any(pteval))) arm_uffd_pte = true; The PTEs (swp PTEs to be precise) contain uffd-wp informtation. [...] >>> + /* >>> + * Short-circuit the below loop if none of the entries had their >>> + * reference drop to zero. >>> + */ >>> + if (!any_only_cache) >>> + goto out; >>> - count = __swap_entry_free(p, entry); >>> - if (count == SWAP_HAS_CACHE) >>> - __try_to_reclaim_swap(p, swp_offset(entry), >>> + /* >>> + * Now go back over the range trying to reclaim the swap cache. This is >>> + * more efficient for large folios because we will only try to reclaim >>> + * the swap once per folio in the common case. If we do >>> + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the >>> + * latter will get a reference and lock the folio for every individual >>> + * page but will only succeed once the swap slot for every subpage is >>> + * zero. >>> + */ >>> + for (offset = swp_offset(entry); offset < end; offset += nr) { >>> + nr = 1; >>> + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { >> >> Here we use READ_ONCE() only, above data_race(). Hmmm. > > Yes. I think this is correct. > > READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain > about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is > used. I believe READ_ONCE() is required here because we don't have a lock and we > want to make sure we read it in a non-tearing manner. > > We don't need the READ_ONCE() above since we don't care about the exact value - > only that it's not 0 (because we should be holding a ref). So do a plain access > to give the compiler a bit more freedom. But we need to mark that with > data_race() to stop KCSAN from complaining. Okay.
On 08/04/2024 10:43, David Hildenbrand wrote: > >>>> + >>>> +/** >>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>>> + * @start_ptep: Page table pointer for the first entry. >>>> + * @max_nr: The maximum number of table entries to consider. >>>> + * @entry: Swap entry recovered from the first table entry. >>>> + * >>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>>> + * containing swap entries all with consecutive offsets and targeting the same >>>> + * swap type. >>>> + * >>> >>> Likely you should document that any swp pte bits are ignored? () >> >> Sorry I don't understand this comment. I thought any non-none, non-present PTE >> was always considered to contain only a "swap entry" and a swap entry consists >> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if >> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that >> are not part of the swap entry? > > > pte_swp_soft_dirty() > pte_swp_clear_exclusive() > pte_swp_uffd_wp() > > Are PTE bits used for swp PTE. Ahh wow. mind blown. Looks like a massive hack... why not store them in the arch-independent swap entry, rather than have them squat independently in the PTE? OK, my implementation is buggy. I'll re-spin to fix this. > > There is also dirty/young for migration entries, but that's not of a concern > here, because we stop for non_swap_entry(). Looks like these are part of the offset field in the arch-independent swap entry - much cleaner ;-). > >> >> >>> >>>> + * max_nr must be at least one and must be limited by the caller so scanning >>>> + * cannot exceed a single page table. >>>> + * >>>> + * Return: the number of table entries in the batch. >>>> + */ >>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>>> + swp_entry_t entry) >>>> +{ >>>> + const pte_t *end_ptep = start_ptep + max_nr; >>>> + unsigned long expected_offset = swp_offset(entry) + 1; >>>> + unsigned int expected_type = swp_type(entry); >>>> + pte_t *ptep = start_ptep + 1; >>>> + >>>> + VM_WARN_ON(max_nr < 1); >>>> + VM_WARN_ON(non_swap_entry(entry)); >>>> + >>>> + while (ptep < end_ptep) { >>>> + pte_t pte = ptep_get(ptep); >>>> + >>>> + if (pte_none(pte) || pte_present(pte)) >>>> + break; >>>> + >>>> + entry = pte_to_swp_entry(pte); >>>> + >>>> + if (non_swap_entry(entry) || >>>> + swp_type(entry) != expected_type || >>>> + swp_offset(entry) != expected_offset) >>>> + break; >>>> + >>>> + expected_offset++; >>>> + ptep++; >>>> + } >>>> + >>>> + return ptep - start_ptep; >>>> +} >>> >>> Looks very clean :) >>> >>> I was wondering whether we could similarly construct the expected swp PTE and >>> only check pte_same. >>> >>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); >>> >>> ... or have a variant to increase only the swp offset for an existing pte. But >>> non-trivial due to the arch-dependent format. >>> >>> But then, we'd fail on mismatch of other swp pte bits. >> >> Hmm, perhaps I have a misunderstanding regarding "swp pte bits"... >> >>> >>> >>> On swapin, when reusing this function (likely!), we'll might to make sure that >>> the PTE bits match as well. >>> >>> See below regarding uffd-wp. >>> >>> >>>> #endif /* CONFIG_MMU */ >>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 1f77a51baaac..070bedb4996e 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> struct folio *folio; >>>> int nr_swap = 0; >>>> unsigned long next; >>>> + int nr, max_nr; >>>> next = pmd_addr_end(addr, end); >>>> if (pmd_trans_huge(*pmd)) >>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> return 0; >>>> flush_tlb_batched_pending(mm); >>>> arch_enter_lazy_mmu_mode(); >>>> - for (; addr != end; pte++, addr += PAGE_SIZE) { >>>> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >>>> + nr = 1; >>>> ptent = ptep_get(pte); >>>> if (pte_none(ptent)) >>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> entry = pte_to_swp_entry(ptent); >>>> if (!non_swap_entry(entry)) { >>>> - nr_swap--; >>>> - free_swap_and_cache(entry); >>>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> + max_nr = (end - addr) / PAGE_SIZE; >>>> + nr = swap_pte_batch(pte, max_nr, entry); >>>> + nr_swap -= nr; >>>> + free_swap_and_cache_nr(entry, nr); >>>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>>> } else if (is_hwpoison_entry(entry) || >>>> is_poisoned_swp_entry(entry)) { >>>> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 7dc6c3d9fa83..ef2968894718 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >>>> *tlb, >>>> folio_remove_rmap_pte(folio, page, vma); >>>> folio_put(folio); >>>> } else if (!non_swap_entry(entry)) { >>>> - /* Genuine swap entry, hence a private anon page */ >>>> + max_nr = (end - addr) / PAGE_SIZE; >>>> + nr = swap_pte_batch(pte, max_nr, entry); >>>> + /* Genuine swap entries, hence a private anon pages */ >>>> if (!should_zap_cows(details)) >>>> continue; >>>> - rss[MM_SWAPENTS]--; >>>> - if (unlikely(!free_swap_and_cache(entry))) >>>> - print_bad_pte(vma, addr, ptent, NULL); >>>> + rss[MM_SWAPENTS] -= nr; >>>> + free_swap_and_cache_nr(entry, nr); >>>> } else if (is_migration_entry(entry)) { >>>> folio = pfn_swap_entry_folio(entry); >>>> if (!should_zap_folio(details, folio)) >>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather >>>> *tlb, >>>> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>>> WARN_ON_ONCE(1); >>>> } >>>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >>>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>> >>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. >>> >>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in >>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. >>> >>> On mixture, you either lose some or place too many markers. >> >> What path are you concerned about here? I don't get how what you describe can >> happen? swap_pte_batch() will only give me a batch of actual swap entries and >> actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a >> "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based >> on the VMA state at swap-in? I think you are telling me that it's persisted >> across the swap per-pte? > > Please see zap_install_uffd_wp_if_needed(): > > if (unlikely(pte_swp_uffd_wp_any(pteval))) > arm_uffd_pte = true; > > The PTEs (swp PTEs to be precise) contain uffd-wp informtation. > > [...] > >>>> + /* >>>> + * Short-circuit the below loop if none of the entries had their >>>> + * reference drop to zero. >>>> + */ >>>> + if (!any_only_cache) >>>> + goto out; >>>> - count = __swap_entry_free(p, entry); >>>> - if (count == SWAP_HAS_CACHE) >>>> - __try_to_reclaim_swap(p, swp_offset(entry), >>>> + /* >>>> + * Now go back over the range trying to reclaim the swap cache. This is >>>> + * more efficient for large folios because we will only try to reclaim >>>> + * the swap once per folio in the common case. If we do >>>> + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the >>>> + * latter will get a reference and lock the folio for every individual >>>> + * page but will only succeed once the swap slot for every subpage is >>>> + * zero. >>>> + */ >>>> + for (offset = swp_offset(entry); offset < end; offset += nr) { >>>> + nr = 1; >>>> + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { >>> >>> Here we use READ_ONCE() only, above data_race(). Hmmm. >> >> Yes. I think this is correct. >> >> READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain >> about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is >> used. I believe READ_ONCE() is required here because we don't have a lock and we >> want to make sure we read it in a non-tearing manner. >> >> We don't need the READ_ONCE() above since we don't care about the exact value - >> only that it's not 0 (because we should be holding a ref). So do a plain access >> to give the compiler a bit more freedom. But we need to mark that with >> data_race() to stop KCSAN from complaining. > > Okay. >
On 08/04/2024 11:24, David Hildenbrand wrote: > On 08.04.24 12:07, Ryan Roberts wrote: >> On 08/04/2024 10:43, David Hildenbrand wrote: >>> >>>>>> + >>>>>> +/** >>>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>>>>> + * @start_ptep: Page table pointer for the first entry. >>>>>> + * @max_nr: The maximum number of table entries to consider. >>>>>> + * @entry: Swap entry recovered from the first table entry. >>>>>> + * >>>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>>>>> + * containing swap entries all with consecutive offsets and targeting the >>>>>> same >>>>>> + * swap type. >>>>>> + * >>>>> >>>>> Likely you should document that any swp pte bits are ignored? () >>>> >>>> Sorry I don't understand this comment. I thought any non-none, non-present PTE >>>> was always considered to contain only a "swap entry" and a swap entry consists >>>> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if >>>> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE >>>> that >>>> are not part of the swap entry? >>> >>> >>> pte_swp_soft_dirty() >>> pte_swp_clear_exclusive() >>> pte_swp_uffd_wp() >>> >>> Are PTE bits used for swp PTE. >> >> Ahh wow. mind blown. Looks like a massive hack... why not store them in the >> arch-independent swap entry, rather than have them squat independently in the >> PTE? > > I think that was discussed at some point, but it not only requires quite some > churn to change it (all that swp entry code is a mess), these bits are > conceptually really per PTE and not something you would want to pass into actual > swap handling code that couldn't care less about all of these. > > I looked at this when I added SWP exclusive, but accidentally losing the SWP > exclusive marker when converting back and forth made me go the PTE route instead. > > Then, the available PTE bits are a bit scattered on some architectures, and > converting entry<->PTE gets even uglier if we don't want to "lose" available bits. > > Probably the whole "unsigned long swp_entry" stuff should be replaced by a > proper struct where we could more easily add flags and have the arch code handle > the conversion to the PTE just once. The arch-specific swp_entry stuff is > another nasty thing IMHO. Yep understood. I'll file this under "there be dragons". Thanks for the explanation. > >> >> OK, my implementation is buggy. I'll re-spin to fix this. >> >> >>> >>> There is also dirty/young for migration entries, but that's not of a concern >>> here, because we stop for non_swap_entry(). >> >> Looks like these are part of the offset field in the arch-independent swap entry >> - much cleaner ;-). > > Note that it only applies to migration entries, and only when we have some spare > bits due to PFN < offset. Yep got it. Thanks!
[...] > > [...] > >> + >> +/** >> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >> + * @start_ptep: Page table pointer for the first entry. >> + * @max_nr: The maximum number of table entries to consider. >> + * @entry: Swap entry recovered from the first table entry. >> + * >> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >> + * containing swap entries all with consecutive offsets and targeting the same >> + * swap type. >> + * > > Likely you should document that any swp pte bits are ignored? () Now that I understand what swp pte bits are, I think the simplest thing is to just make this function always consider the pte bits by using pte_same() as you suggest below? I don't think there is ever a case for ignoring the swp pte bits? And then I don't need to do anything special for uffd-wp either (below you suggested not doing batching when the VMA has uffd enabled). Any concerns? > >> + * max_nr must be at least one and must be limited by the caller so scanning >> + * cannot exceed a single page table. >> + * >> + * Return: the number of table entries in the batch. >> + */ >> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >> + swp_entry_t entry) >> +{ >> + const pte_t *end_ptep = start_ptep + max_nr; >> + unsigned long expected_offset = swp_offset(entry) + 1; >> + unsigned int expected_type = swp_type(entry); >> + pte_t *ptep = start_ptep + 1; >> + >> + VM_WARN_ON(max_nr < 1); >> + VM_WARN_ON(non_swap_entry(entry)); >> + >> + while (ptep < end_ptep) { >> + pte_t pte = ptep_get(ptep); >> + >> + if (pte_none(pte) || pte_present(pte)) >> + break; >> + >> + entry = pte_to_swp_entry(pte); >> + >> + if (non_swap_entry(entry) || >> + swp_type(entry) != expected_type || >> + swp_offset(entry) != expected_offset) >> + break; >> + >> + expected_offset++; >> + ptep++; >> + } >> + >> + return ptep - start_ptep; >> +} > > Looks very clean :) > > I was wondering whether we could similarly construct the expected swp PTE and > only check pte_same. > > expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); So planning to do this. > > ... or have a variant to increase only the swp offset for an existing pte. But > non-trivial due to the arch-dependent format. not this - I agree this will be difficult due to per-arch changes. I'd rather just do the generic version and leave the compiler to do the best it can to simplify and optimize. > > But then, we'd fail on mismatch of other swp pte bits. > > > On swapin, when reusing this function (likely!), we'll might to make sure that > the PTE bits match as well. > > See below regarding uffd-wp. > > >> #endif /* CONFIG_MMU */ >> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 1f77a51baaac..070bedb4996e 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> struct folio *folio; >> int nr_swap = 0; >> unsigned long next; >> + int nr, max_nr; >> next = pmd_addr_end(addr, end); >> if (pmd_trans_huge(*pmd)) >> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> return 0; >> flush_tlb_batched_pending(mm); >> arch_enter_lazy_mmu_mode(); >> - for (; addr != end; pte++, addr += PAGE_SIZE) { >> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >> + nr = 1; >> ptent = ptep_get(pte); >> if (pte_none(ptent)) >> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >> entry = pte_to_swp_entry(ptent); >> if (!non_swap_entry(entry)) { >> - nr_swap--; >> - free_swap_and_cache(entry); >> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> + max_nr = (end - addr) / PAGE_SIZE; >> + nr = swap_pte_batch(pte, max_nr, entry); >> + nr_swap -= nr; >> + free_swap_and_cache_nr(entry, nr); >> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >> } else if (is_hwpoison_entry(entry) || >> is_poisoned_swp_entry(entry)) { >> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> diff --git a/mm/memory.c b/mm/memory.c >> index 7dc6c3d9fa83..ef2968894718 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >> *tlb, >> folio_remove_rmap_pte(folio, page, vma); >> folio_put(folio); >> } else if (!non_swap_entry(entry)) { >> - /* Genuine swap entry, hence a private anon page */ >> + max_nr = (end - addr) / PAGE_SIZE; >> + nr = swap_pte_batch(pte, max_nr, entry); >> + /* Genuine swap entries, hence a private anon pages */ >> if (!should_zap_cows(details)) >> continue; >> - rss[MM_SWAPENTS]--; >> - if (unlikely(!free_swap_and_cache(entry))) >> - print_bad_pte(vma, addr, ptent, NULL); >> + rss[MM_SWAPENTS] -= nr; >> + free_swap_and_cache_nr(entry, nr); >> } else if (is_migration_entry(entry)) { >> folio = pfn_swap_entry_folio(entry); >> if (!should_zap_folio(details, folio)) >> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >> WARN_ON_ONCE(1); >> } >> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); > > For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. > > zap_install_uffd_wp_if_needed() will use the uffd-wp information in > ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. > > On mixture, you either lose some or place too many markers. > > A simple workaround would be to disable any such batching if the VMA does have > uffd-wp enabled. Rather than this, I'll just consider all the swp pte bits when batching. > >> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); [...]
On 08/04/2024 13:07, Ryan Roberts wrote: > [...] >> >> [...] >> >>> + >>> +/** >>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>> + * @start_ptep: Page table pointer for the first entry. >>> + * @max_nr: The maximum number of table entries to consider. >>> + * @entry: Swap entry recovered from the first table entry. >>> + * >>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>> + * containing swap entries all with consecutive offsets and targeting the same >>> + * swap type. >>> + * >> >> Likely you should document that any swp pte bits are ignored? () > > Now that I understand what swp pte bits are, I think the simplest thing is to > just make this function always consider the pte bits by using pte_same() as you > suggest below? I don't think there is ever a case for ignoring the swp pte bits? > And then I don't need to do anything special for uffd-wp either (below you > suggested not doing batching when the VMA has uffd enabled). > > Any concerns? > >> >>> + * max_nr must be at least one and must be limited by the caller so scanning >>> + * cannot exceed a single page table. >>> + * >>> + * Return: the number of table entries in the batch. >>> + */ >>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>> + swp_entry_t entry) >>> +{ >>> + const pte_t *end_ptep = start_ptep + max_nr; >>> + unsigned long expected_offset = swp_offset(entry) + 1; >>> + unsigned int expected_type = swp_type(entry); >>> + pte_t *ptep = start_ptep + 1; >>> + >>> + VM_WARN_ON(max_nr < 1); >>> + VM_WARN_ON(non_swap_entry(entry)); >>> + >>> + while (ptep < end_ptep) { >>> + pte_t pte = ptep_get(ptep); >>> + >>> + if (pte_none(pte) || pte_present(pte)) >>> + break; >>> + >>> + entry = pte_to_swp_entry(pte); >>> + >>> + if (non_swap_entry(entry) || >>> + swp_type(entry) != expected_type || >>> + swp_offset(entry) != expected_offset) >>> + break; >>> + >>> + expected_offset++; >>> + ptep++; >>> + } >>> + >>> + return ptep - start_ptep; >>> +} >> >> Looks very clean :) >> >> I was wondering whether we could similarly construct the expected swp PTE and >> only check pte_same. >> >> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); > > So planning to do this. Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex. If we can safely assume all offset bits are contiguous in every per-arch representation then we can do: static inline pte_t pte_next_swp_offset(pte_t pte) { pte_t offset_inc = __swp_entry_to_pte(__swp_entry(0, 1)); return __pte(pte_val(pte) + pte_val(offset_inc)); } Or if not: static inline pte_t pte_next_swp_offset(pte_t pte) { swp_entry_t entry = pte_to_swp_entry(pte); pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), swp_offset(entry) + 1)); if (pte_swp_soft_dirty(pte)) new = pte_swp_mksoft_dirty(new); if (pte_swp_exclusive(pte)) new = pte_swp_mkexclusive(new); if (pte_swp_uffd_wp(pte)) new = pte_swp_mkuffd_wp(new); return new; } Then swap_pte_batch() becomes: static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) { pte_t expected_pte = pte_next_swp_offset(pte); const pte_t *end_ptep = start_ptep + max_nr; pte_t *ptep = start_ptep + 1; VM_WARN_ON(max_nr < 1); VM_WARN_ON(!is_swap_pte(pte)); VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte))); while (ptep < end_ptep) { pte = ptep_get(ptep); if (!pte_same(pte, expected_pte)) break; expected_pte = pte_next_swp_offset(expected_pte); ptep++; } return ptep - start_ptep; } Would you be happy with either of these? I'll go look if we can assume the offset bits are always contiguous. > >> >> ... or have a variant to increase only the swp offset for an existing pte. But >> non-trivial due to the arch-dependent format. > > not this - I agree this will be difficult due to per-arch changes. I'd rather > just do the generic version and leave the compiler to do the best it can to > simplify and optimize. > >> >> But then, we'd fail on mismatch of other swp pte bits. >> >> >> On swapin, when reusing this function (likely!), we'll might to make sure that >> the PTE bits match as well. >> >> See below regarding uffd-wp. >> >> >>> #endif /* CONFIG_MMU */ >>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >>> diff --git a/mm/madvise.c b/mm/madvise.c >>> index 1f77a51baaac..070bedb4996e 100644 >>> --- a/mm/madvise.c >>> +++ b/mm/madvise.c >>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> struct folio *folio; >>> int nr_swap = 0; >>> unsigned long next; >>> + int nr, max_nr; >>> next = pmd_addr_end(addr, end); >>> if (pmd_trans_huge(*pmd)) >>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> return 0; >>> flush_tlb_batched_pending(mm); >>> arch_enter_lazy_mmu_mode(); >>> - for (; addr != end; pte++, addr += PAGE_SIZE) { >>> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >>> + nr = 1; >>> ptent = ptep_get(pte); >>> if (pte_none(ptent)) >>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>> long addr, >>> entry = pte_to_swp_entry(ptent); >>> if (!non_swap_entry(entry)) { >>> - nr_swap--; >>> - free_swap_and_cache(entry); >>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> + max_nr = (end - addr) / PAGE_SIZE; >>> + nr = swap_pte_batch(pte, max_nr, entry); >>> + nr_swap -= nr; >>> + free_swap_and_cache_nr(entry, nr); >>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>> } else if (is_hwpoison_entry(entry) || >>> is_poisoned_swp_entry(entry)) { >>> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 7dc6c3d9fa83..ef2968894718 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >>> *tlb, >>> folio_remove_rmap_pte(folio, page, vma); >>> folio_put(folio); >>> } else if (!non_swap_entry(entry)) { >>> - /* Genuine swap entry, hence a private anon page */ >>> + max_nr = (end - addr) / PAGE_SIZE; >>> + nr = swap_pte_batch(pte, max_nr, entry); >>> + /* Genuine swap entries, hence a private anon pages */ >>> if (!should_zap_cows(details)) >>> continue; >>> - rss[MM_SWAPENTS]--; >>> - if (unlikely(!free_swap_and_cache(entry))) >>> - print_bad_pte(vma, addr, ptent, NULL); >>> + rss[MM_SWAPENTS] -= nr; >>> + free_swap_and_cache_nr(entry, nr); >>> } else if (is_migration_entry(entry)) { >>> folio = pfn_swap_entry_folio(entry); >>> if (!should_zap_folio(details, folio)) >>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>> WARN_ON_ONCE(1); >>> } >>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >> >> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. >> >> zap_install_uffd_wp_if_needed() will use the uffd-wp information in >> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. >> >> On mixture, you either lose some or place too many markers. >> >> A simple workaround would be to disable any such batching if the VMA does have >> uffd-wp enabled. > > Rather than this, I'll just consider all the swp pte bits when batching. > >> >>> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >>> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); > > [...] >
On 08/04/2024 13:47, Ryan Roberts wrote: > On 08/04/2024 13:07, Ryan Roberts wrote: >> [...] >>> >>> [...] >>> >>>> + >>>> +/** >>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>>> + * @start_ptep: Page table pointer for the first entry. >>>> + * @max_nr: The maximum number of table entries to consider. >>>> + * @entry: Swap entry recovered from the first table entry. >>>> + * >>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>>> + * containing swap entries all with consecutive offsets and targeting the same >>>> + * swap type. >>>> + * >>> >>> Likely you should document that any swp pte bits are ignored? () >> >> Now that I understand what swp pte bits are, I think the simplest thing is to >> just make this function always consider the pte bits by using pte_same() as you >> suggest below? I don't think there is ever a case for ignoring the swp pte bits? >> And then I don't need to do anything special for uffd-wp either (below you >> suggested not doing batching when the VMA has uffd enabled). >> >> Any concerns? >> >>> >>>> + * max_nr must be at least one and must be limited by the caller so scanning >>>> + * cannot exceed a single page table. >>>> + * >>>> + * Return: the number of table entries in the batch. >>>> + */ >>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>>> + swp_entry_t entry) >>>> +{ >>>> + const pte_t *end_ptep = start_ptep + max_nr; >>>> + unsigned long expected_offset = swp_offset(entry) + 1; >>>> + unsigned int expected_type = swp_type(entry); >>>> + pte_t *ptep = start_ptep + 1; >>>> + >>>> + VM_WARN_ON(max_nr < 1); >>>> + VM_WARN_ON(non_swap_entry(entry)); >>>> + >>>> + while (ptep < end_ptep) { >>>> + pte_t pte = ptep_get(ptep); >>>> + >>>> + if (pte_none(pte) || pte_present(pte)) >>>> + break; >>>> + >>>> + entry = pte_to_swp_entry(pte); >>>> + >>>> + if (non_swap_entry(entry) || >>>> + swp_type(entry) != expected_type || >>>> + swp_offset(entry) != expected_offset) >>>> + break; >>>> + >>>> + expected_offset++; >>>> + ptep++; >>>> + } >>>> + >>>> + return ptep - start_ptep; >>>> +} >>> >>> Looks very clean :) >>> >>> I was wondering whether we could similarly construct the expected swp PTE and >>> only check pte_same. >>> >>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); >> >> So planning to do this. > > Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex. > > If we can safely assume all offset bits are contiguous in every per-arch representation then we can do: Looks like at least csky and hexagon store the offset in discontiguous regions. So it will have to be the second approach if we want to avoid anything arch-specific. I'll assume that for now; we can always specialize pte_next_swp_offset() per-arch in the future if needed. > > static inline pte_t pte_next_swp_offset(pte_t pte) > { > pte_t offset_inc = __swp_entry_to_pte(__swp_entry(0, 1)); > > return __pte(pte_val(pte) + pte_val(offset_inc)); > } > > Or if not: > > static inline pte_t pte_next_swp_offset(pte_t pte) > { > swp_entry_t entry = pte_to_swp_entry(pte); > pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), swp_offset(entry) + 1)); > > if (pte_swp_soft_dirty(pte)) > new = pte_swp_mksoft_dirty(new); > if (pte_swp_exclusive(pte)) > new = pte_swp_mkexclusive(new); > if (pte_swp_uffd_wp(pte)) > new = pte_swp_mkuffd_wp(new); > > return new; > } > > Then swap_pte_batch() becomes: > > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) > { > pte_t expected_pte = pte_next_swp_offset(pte); > const pte_t *end_ptep = start_ptep + max_nr; > pte_t *ptep = start_ptep + 1; > > VM_WARN_ON(max_nr < 1); > VM_WARN_ON(!is_swap_pte(pte)); > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte))); > > while (ptep < end_ptep) { > pte = ptep_get(ptep); > > if (!pte_same(pte, expected_pte)) > break; > > expected_pte = pte_next_swp_offset(expected_pte); > ptep++; > } > > return ptep - start_ptep; > } > > Would you be happy with either of these? I'll go look if we can assume the offset bits are always contiguous. > > >> >>> >>> ... or have a variant to increase only the swp offset for an existing pte. But >>> non-trivial due to the arch-dependent format. >> >> not this - I agree this will be difficult due to per-arch changes. I'd rather >> just do the generic version and leave the compiler to do the best it can to >> simplify and optimize. >> >>> >>> But then, we'd fail on mismatch of other swp pte bits. >>> >>> >>> On swapin, when reusing this function (likely!), we'll might to make sure that >>> the PTE bits match as well. >>> >>> See below regarding uffd-wp. >>> >>> >>>> #endif /* CONFIG_MMU */ >>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 1f77a51baaac..070bedb4996e 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> struct folio *folio; >>>> int nr_swap = 0; >>>> unsigned long next; >>>> + int nr, max_nr; >>>> next = pmd_addr_end(addr, end); >>>> if (pmd_trans_huge(*pmd)) >>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> return 0; >>>> flush_tlb_batched_pending(mm); >>>> arch_enter_lazy_mmu_mode(); >>>> - for (; addr != end; pte++, addr += PAGE_SIZE) { >>>> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >>>> + nr = 1; >>>> ptent = ptep_get(pte); >>>> if (pte_none(ptent)) >>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>> entry = pte_to_swp_entry(ptent); >>>> if (!non_swap_entry(entry)) { >>>> - nr_swap--; >>>> - free_swap_and_cache(entry); >>>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> + max_nr = (end - addr) / PAGE_SIZE; >>>> + nr = swap_pte_batch(pte, max_nr, entry); >>>> + nr_swap -= nr; >>>> + free_swap_and_cache_nr(entry, nr); >>>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>>> } else if (is_hwpoison_entry(entry) || >>>> is_poisoned_swp_entry(entry)) { >>>> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 7dc6c3d9fa83..ef2968894718 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >>>> *tlb, >>>> folio_remove_rmap_pte(folio, page, vma); >>>> folio_put(folio); >>>> } else if (!non_swap_entry(entry)) { >>>> - /* Genuine swap entry, hence a private anon page */ >>>> + max_nr = (end - addr) / PAGE_SIZE; >>>> + nr = swap_pte_batch(pte, max_nr, entry); >>>> + /* Genuine swap entries, hence a private anon pages */ >>>> if (!should_zap_cows(details)) >>>> continue; >>>> - rss[MM_SWAPENTS]--; >>>> - if (unlikely(!free_swap_and_cache(entry))) >>>> - print_bad_pte(vma, addr, ptent, NULL); >>>> + rss[MM_SWAPENTS] -= nr; >>>> + free_swap_and_cache_nr(entry, nr); >>>> } else if (is_migration_entry(entry)) { >>>> folio = pfn_swap_entry_folio(entry); >>>> if (!should_zap_folio(details, folio)) >>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>> pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>>> WARN_ON_ONCE(1); >>>> } >>>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >>>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>> >>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. >>> >>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in >>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. >>> >>> On mixture, you either lose some or place too many markers. >>> >>> A simple workaround would be to disable any such batching if the VMA does have >>> uffd-wp enabled. >> >> Rather than this, I'll just consider all the swp pte bits when batching. >> >>> >>>> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >>>> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >> >> [...] >> >
On 08.04.24 15:27, Ryan Roberts wrote: > On 08/04/2024 13:47, Ryan Roberts wrote: >> On 08/04/2024 13:07, Ryan Roberts wrote: >>> [...] >>>> >>>> [...] >>>> >>>>> + >>>>> +/** >>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>>>> + * @start_ptep: Page table pointer for the first entry. >>>>> + * @max_nr: The maximum number of table entries to consider. >>>>> + * @entry: Swap entry recovered from the first table entry. >>>>> + * >>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>>>> + * containing swap entries all with consecutive offsets and targeting the same >>>>> + * swap type. >>>>> + * >>>> >>>> Likely you should document that any swp pte bits are ignored? () >>> >>> Now that I understand what swp pte bits are, I think the simplest thing is to >>> just make this function always consider the pte bits by using pte_same() as you >>> suggest below? I don't think there is ever a case for ignoring the swp pte bits? >>> And then I don't need to do anything special for uffd-wp either (below you >>> suggested not doing batching when the VMA has uffd enabled). >>> >>> Any concerns? >>> >>>> >>>>> + * max_nr must be at least one and must be limited by the caller so scanning >>>>> + * cannot exceed a single page table. >>>>> + * >>>>> + * Return: the number of table entries in the batch. >>>>> + */ >>>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>>>> + swp_entry_t entry) >>>>> +{ >>>>> + const pte_t *end_ptep = start_ptep + max_nr; >>>>> + unsigned long expected_offset = swp_offset(entry) + 1; >>>>> + unsigned int expected_type = swp_type(entry); >>>>> + pte_t *ptep = start_ptep + 1; >>>>> + >>>>> + VM_WARN_ON(max_nr < 1); >>>>> + VM_WARN_ON(non_swap_entry(entry)); >>>>> + >>>>> + while (ptep < end_ptep) { >>>>> + pte_t pte = ptep_get(ptep); >>>>> + >>>>> + if (pte_none(pte) || pte_present(pte)) >>>>> + break; >>>>> + >>>>> + entry = pte_to_swp_entry(pte); >>>>> + >>>>> + if (non_swap_entry(entry) || >>>>> + swp_type(entry) != expected_type || >>>>> + swp_offset(entry) != expected_offset) >>>>> + break; >>>>> + >>>>> + expected_offset++; >>>>> + ptep++; >>>>> + } >>>>> + >>>>> + return ptep - start_ptep; >>>>> +} >>>> >>>> Looks very clean :) >>>> >>>> I was wondering whether we could similarly construct the expected swp PTE and >>>> only check pte_same. >>>> >>>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); >>> >>> So planning to do this. >> >> Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex. >> >> If we can safely assume all offset bits are contiguous in every per-arch representation then we can do: > > Looks like at least csky and hexagon store the offset in discontiguous regions. > So it will have to be the second approach if we want to avoid anything > arch-specific. I'll assume that for now; we can always specialize > pte_next_swp_offset() per-arch in the future if needed. Sounds good. Just have a generic variant as you proposed, and add the per-arch one if really required later.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a3fc8150b047..0278259f7078 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, } #endif +#ifndef clear_not_present_full_ptes +/** + * clear_not_present_full_ptes - Clear consecutive not present PTEs. + * @mm: Address space the ptes represent. + * @addr: Address of the first pte. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over pte_clear_not_present_full(). + * + * Context: The caller holds the page table lock. The PTEs are all not present. + * The PTEs are all in the same PMD. + */ +static inline void clear_not_present_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + for (;;) { + pte_clear_not_present_full(mm, addr, ptep, full); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH extern pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/swap.h b/include/linux/swap.h index f6f78198f000..5737236dc3ce 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); -extern int free_swap_and_cache(swp_entry_t); +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ -#define free_swap_and_cache(e) is_pfn_swap_entry(e) +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) +{ +} static inline void free_swap_cache(struct folio *folio) { @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */ +static inline void free_swap_and_cache(swp_entry_t entry) +{ + free_swap_and_cache_nr(entry, 1); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/internal.h b/mm/internal.h index d567381b12cc..88705ab4c50a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,8 @@ #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/rmap.h> +#include <linux/swap.h> +#include <linux/swapops.h> #include <linux/tracepoint-defs.h> struct folio_batch; @@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, return min(ptep - start_ptep, max_nr); } + +/** + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries + * @start_ptep: Page table pointer for the first entry. + * @max_nr: The maximum number of table entries to consider. + * @entry: Swap entry recovered from the first table entry. + * + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs + * containing swap entries all with consecutive offsets and targeting the same + * swap type. + * + * max_nr must be at least one and must be limited by the caller so scanning + * cannot exceed a single page table. + * + * Return: the number of table entries in the batch. + */ +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, + swp_entry_t entry) +{ + const pte_t *end_ptep = start_ptep + max_nr; + unsigned long expected_offset = swp_offset(entry) + 1; + unsigned int expected_type = swp_type(entry); + pte_t *ptep = start_ptep + 1; + + VM_WARN_ON(max_nr < 1); + VM_WARN_ON(non_swap_entry(entry)); + + while (ptep < end_ptep) { + pte_t pte = ptep_get(ptep); + + if (pte_none(pte) || pte_present(pte)) + break; + + entry = pte_to_swp_entry(pte); + + if (non_swap_entry(entry) || + swp_type(entry) != expected_type || + swp_offset(entry) != expected_offset) + break; + + expected_offset++; + ptep++; + } + + return ptep - start_ptep; +} #endif /* CONFIG_MMU */ void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, diff --git a/mm/madvise.c b/mm/madvise.c index 1f77a51baaac..070bedb4996e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, struct folio *folio; int nr_swap = 0; unsigned long next; + int nr, max_nr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); - for (; addr != end; pte++, addr += PAGE_SIZE) { + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { + nr = 1; ptent = ptep_get(pte); if (pte_none(ptent)) @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, entry = pte_to_swp_entry(ptent); if (!non_swap_entry(entry)) { - nr_swap--; - free_swap_and_cache(entry); - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, entry); + nr_swap -= nr; + free_swap_and_cache_nr(entry, nr); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); diff --git a/mm/memory.c b/mm/memory.c index 7dc6c3d9fa83..ef2968894718 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, folio_remove_rmap_pte(folio, page, vma); folio_put(folio); } else if (!non_swap_entry(entry)) { - /* Genuine swap entry, hence a private anon page */ + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, entry); + /* Genuine swap entries, hence a private anon pages */ if (!should_zap_cows(details)) continue; - rss[MM_SWAPENTS]--; - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); + rss[MM_SWAPENTS] -= nr; + free_swap_and_cache_nr(entry, nr); } else if (is_migration_entry(entry)) { folio = pfn_swap_entry_folio(entry); if (!should_zap_folio(details, folio)) @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pr_alert("unrecognized swap entry 0x%lx\n", entry.val); WARN_ON_ONCE(1); } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); add_mm_rss_vec(mm, rss); diff --git a/mm/swapfile.c b/mm/swapfile.c index 0d44ee2b4f9c..d059de6896c1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent) /* Reclaim the swap entry if swap is getting full*/ #define TTRS_FULL 0x4 -/* returns 1 if swap entry is freed */ +/* + * returns number of pages in the folio that backs the swap entry. If positive, + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no + * folio was associated with the swap entry. + */ static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, ret = folio_free_swap(folio); folio_unlock(folio); } + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); folio_put(folio); return ret; } @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ - if (swap_was_freed) + if (swap_was_freed > 0) goto checks; goto scan; /* check next one */ } @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio) return true; } -/* - * Free the swap entry like above, but also try to - * free the page cache entry if it is the last user. - */ -int free_swap_and_cache(swp_entry_t entry) +void free_swap_and_cache_nr(swp_entry_t entry, int nr) { - struct swap_info_struct *p; + unsigned long end = swp_offset(entry) + nr; + unsigned int type = swp_type(entry); + struct swap_info_struct *si; + bool any_only_cache = false; + unsigned long offset; unsigned char count; if (non_swap_entry(entry)) - return 1; + return; - p = get_swap_device(entry); - if (p) { - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { - put_swap_device(p); - return 0; + si = get_swap_device(entry); + if (!si) + return; + + if (WARN_ON(end > si->max)) + goto out; + + /* + * First free all entries in the range. + */ + for (offset = swp_offset(entry); offset < end; offset++) { + if (!WARN_ON(data_race(!si->swap_map[offset]))) { + count = __swap_entry_free(si, swp_entry(type, offset)); + if (count == SWAP_HAS_CACHE) + any_only_cache = true; } + } + + /* + * Short-circuit the below loop if none of the entries had their + * reference drop to zero. + */ + if (!any_only_cache) + goto out; - count = __swap_entry_free(p, entry); - if (count == SWAP_HAS_CACHE) - __try_to_reclaim_swap(p, swp_offset(entry), + /* + * Now go back over the range trying to reclaim the swap cache. This is + * more efficient for large folios because we will only try to reclaim + * the swap once per folio in the common case. If we do + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the + * latter will get a reference and lock the folio for every individual + * page but will only succeed once the swap slot for every subpage is + * zero. + */ + for (offset = swp_offset(entry); offset < end; offset += nr) { + nr = 1; + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { + /* + * Folios are always naturally aligned in swap so + * advance forward to the next boundary. Zero means no + * folio was found for the swap entry, so advance by 1 + * in this case. Negative value means folio was found + * but could not be reclaimed. Here we can still advance + * to the next boundary. + */ + nr = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); - put_swap_device(p); + if (nr == 0) + nr = 1; + else if (nr < 0) + nr = -nr; + nr = ALIGN(offset + 1, nr) - offset; + } } - return p != NULL; + +out: + put_swap_device(si); } #ifdef CONFIG_HIBERNATION
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/pgtable.h | 28 ++++++++++++++ include/linux/swap.h | 12 ++++-- mm/internal.h | 48 +++++++++++++++++++++++ mm/madvise.c | 12 ++++-- mm/memory.c | 13 ++++--- mm/swapfile.c | 86 ++++++++++++++++++++++++++++++++--------- 6 files changed, 167 insertions(+), 32 deletions(-)