diff mbox series

[v6,2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

Message ID 20240403114032.1162100-3-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series Swap-out mTHP without splitting | expand

Commit Message

Ryan Roberts April 3, 2024, 11:40 a.m. UTC
Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and
lock a large folio much more often, which could lead to contention and
(e.g.) failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together. This allows us to first drop a reference to each swap
slot before we try to release the cache folio. This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are
already gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void. The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra
reporting wasn't exactly guaranteed. We will still get the warning with
most of the information from get_swap_device(). With the batch version,
we wouldn't know which pte was bad anyway so could print the wrong one.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 28 ++++++++++++++
 include/linux/swap.h    | 12 ++++--
 mm/internal.h           | 48 +++++++++++++++++++++++
 mm/madvise.c            | 12 ++++--
 mm/memory.c             | 13 ++++---
 mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++---------
 6 files changed, 167 insertions(+), 32 deletions(-)

Comments

Ryan Roberts April 8, 2024, 9:22 a.m. UTC | #1
Andrew, Looks like there are a few niggles for me to address below. So please
don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a
new version out (or patch for this depending on how discussion lands) in the
next couple of days.


On 05/04/2024 11:13, David Hildenbrand wrote:
> On 03.04.24 13:40, Ryan Roberts wrote:
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 28 ++++++++++++++
>>   include/linux/swap.h    | 12 ++++--
>>   mm/internal.h           | 48 +++++++++++++++++++++++
>>   mm/madvise.c            | 12 ++++--
>>   mm/memory.c             | 13 ++++---
>>   mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++---------
>>   6 files changed, 167 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a3fc8150b047..0278259f7078 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct
>> mm_struct *mm,
>>   }
>>   #endif
>>   +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> 
> 
> Consecutive only in the page table or also in some other sense?
> 
> I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any
> offset/pfn.

yes.

> 
> Consider document that.

How about: "Clear multiple not present PTEs which are consecutive in the pgtable"?


> 
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +    for (;;) {
>> +        pte_clear_not_present_full(mm, addr, ptep, full);
>> +        if (--nr == 0)
>> +            break;
>> +        ptep++;
>> +        addr += PAGE_SIZE;
>> +    }
>> +}
>> +#endif
>> +
>>   #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>   extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>                     unsigned long address,
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f6f78198f000..5737236dc3ce 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>   extern int swapcache_prepare(swp_entry_t);
>>   extern void swap_free(swp_entry_t);
>>   extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> -extern int free_swap_and_cache(swp_entry_t);
>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>   int swap_type_of(dev_t device, sector_t offset);
>>   int find_first_swap(dev_t *device);
>>   extern unsigned int count_swap_pages(int, int);
>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct
>> *si)
>>   #define free_pages_and_swap_cache(pages, nr) \
>>       release_pages((pages), (nr));
>>   
> 
> 
> [...]
> 
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
> 
> Likely you should document that any swp pte bits are ignored? ()

Sorry I don't understand this comment. I thought any non-none, non-present PTE
was always considered to contain only a "swap entry" and a swap entry consists
of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
are not part of the swap entry?


> 
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> +                 swp_entry_t entry)
>> +{
>> +    const pte_t *end_ptep = start_ptep + max_nr;
>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>> +    unsigned int expected_type = swp_type(entry);
>> +    pte_t *ptep = start_ptep + 1;
>> +
>> +    VM_WARN_ON(max_nr < 1);
>> +    VM_WARN_ON(non_swap_entry(entry));
>> +
>> +    while (ptep < end_ptep) {
>> +        pte_t pte = ptep_get(ptep);
>> +
>> +        if (pte_none(pte) || pte_present(pte))
>> +            break;
>> +
>> +        entry = pte_to_swp_entry(pte);
>> +
>> +        if (non_swap_entry(entry) ||
>> +            swp_type(entry) != expected_type ||
>> +            swp_offset(entry) != expected_offset)
>> +            break;
>> +
>> +        expected_offset++;
>> +        ptep++;
>> +    }
>> +
>> +    return ptep - start_ptep;
>> +}
> 
> Looks very clean :)
> 
> I was wondering whether we could similarly construct the expected swp PTE and
> only check pte_same.
> 
> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
> 
> ... or have a variant to increase only the swp offset for an existing pte. But
> non-trivial due to the arch-dependent format.
> 
> But then, we'd fail on mismatch of other swp pte bits.

Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...

> 
> 
> On swapin, when reusing this function (likely!), we'll might to make sure that
> the PTE bits match as well.
> 
> See below regarding uffd-wp.
> 
> 
>>   #endif /* CONFIG_MMU */
>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>       struct folio *folio;
>>       int nr_swap = 0;
>>       unsigned long next;
>> +    int nr, max_nr;
>>         next = pmd_addr_end(addr, end);
>>       if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>           return 0;
>>       flush_tlb_batched_pending(mm);
>>       arch_enter_lazy_mmu_mode();
>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> +        nr = 1;
>>           ptent = ptep_get(pte);
>>             if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>                 entry = pte_to_swp_entry(ptent);
>>               if (!non_swap_entry(entry)) {
>> -                nr_swap--;
>> -                free_swap_and_cache(entry);
>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +                max_nr = (end - addr) / PAGE_SIZE;
>> +                nr = swap_pte_batch(pte, max_nr, entry);
>> +                nr_swap -= nr;
>> +                free_swap_and_cache_nr(entry, nr);
>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>               } else if (is_hwpoison_entry(entry) ||
>>                      is_poisoned_swp_entry(entry)) {
>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7dc6c3d9fa83..ef2968894718 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>> *tlb,
>>                   folio_remove_rmap_pte(folio, page, vma);
>>               folio_put(folio);
>>           } else if (!non_swap_entry(entry)) {
>> -            /* Genuine swap entry, hence a private anon page */
>> +            max_nr = (end - addr) / PAGE_SIZE;
>> +            nr = swap_pte_batch(pte, max_nr, entry);
>> +            /* Genuine swap entries, hence a private anon pages */
>>               if (!should_zap_cows(details))
>>                   continue;
>> -            rss[MM_SWAPENTS]--;
>> -            if (unlikely(!free_swap_and_cache(entry)))
>> -                print_bad_pte(vma, addr, ptent, NULL);
>> +            rss[MM_SWAPENTS] -= nr;
>> +            free_swap_and_cache_nr(entry, nr);
>>           } else if (is_migration_entry(entry)) {
>>               folio = pfn_swap_entry_folio(entry);
>>               if (!should_zap_folio(details, folio))
>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>               WARN_ON_ONCE(1);
>>           }
>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> 
> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
> 
> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
> 
> On mixture, you either lose some or place too many markers.

What path are you concerned about here? I don't get how what you describe can
happen? swap_pte_batch() will only give me a batch of actual swap entries and
actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
"non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
on the VMA state at swap-in? I think you are telling me that it's persisted
across the swap per-pte?

> 
> A simple workaround would be to disable any such batching if the VMA does have
> uffd-wp enabled.
> 
>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>         add_mm_rss_vec(mm, rss);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0d44ee2b4f9c..d059de6896c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>   /* Reclaim the swap entry if swap is getting full*/
>>   #define TTRS_FULL        0x4
>>   -/* returns 1 if swap entry is freed */
>> +/*
>> + * returns number of pages in the folio that backs the swap entry. If positive,
>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>> + * folio was associated with the swap entry.
>> + */
>>   static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>                    unsigned long offset, unsigned long flags)
>>   {
>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>               ret = folio_free_swap(folio);
>>           folio_unlock(folio);
>>       }
>> +    ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>       folio_put(folio);
>>       return ret;
>>   }
>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>           swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>           spin_lock(&si->lock);
>>           /* entry was freed successfully, try to use this again */
>> -        if (swap_was_freed)
>> +        if (swap_was_freed > 0)
>>               goto checks;
>>           goto scan; /* check next one */
>>       }
>> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio)
>>       return true;
>>   }
>>   -/*
>> - * Free the swap entry like above, but also try to
>> - * free the page cache entry if it is the last user.
>> - */
>> -int free_swap_and_cache(swp_entry_t entry)
> 
> Can we have some documentation what this function expects? How does nr relate to
> entry?
> 
> i.e., offset range defined by [entry.offset, entry.offset + nr).

Yep, will add something along these lines.

> 
>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>   {
>> -    struct swap_info_struct *p;
> 
> It might be easier to get if you do
> 
> const unsigned long start_offset = swp_offset(entry);
> const unsigned long end_offset = start_offset + nr;

OK will do this.

> 
>> +    unsigned long end = swp_offset(entry) + nr;
>> +    unsigned int type = swp_type(entry);
>> +    struct swap_info_struct *si;
>> +    bool any_only_cache = false;
>> +    unsigned long offset;
>>       unsigned char count;
>>         if (non_swap_entry(entry))
>> -        return 1;
>> +        return;
>>   -    p = get_swap_device(entry);
>> -    if (p) {
>> -        if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>> -            put_swap_device(p);
>> -            return 0;
>> +    si = get_swap_device(entry);
>> +    if (!si)
>> +        return;
>> +
>> +    if (WARN_ON(end > si->max))
>> +        goto out;
>> +
>> +    /*
>> +     * First free all entries in the range.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset++) {
>> +        if (!WARN_ON(data_race(!si->swap_map[offset]))) {
> 
> Ouch "!(!)). Confusing.
> 
> I'm sure there is a better way to write that, maybe using more lines
> 
> if (data_race(si->swap_map[offset])) {
>     ...
> } else {
>     WARN_ON_ONCE(1);
> }

OK, I thought it was kinda neat myself. I'll change it.

> 
> 
>> +            count = __swap_entry_free(si, swp_entry(type, offset));
>> +            if (count == SWAP_HAS_CACHE)
>> +                any_only_cache = true;
>>           }
>> +    }
>> +
>> +    /*
>> +     * Short-circuit the below loop if none of the entries had their
>> +     * reference drop to zero.
>> +     */
>> +    if (!any_only_cache)
>> +        goto out;
>>   -        count = __swap_entry_free(p, entry);
>> -        if (count == SWAP_HAS_CACHE)
>> -            __try_to_reclaim_swap(p, swp_offset(entry),
>> +    /*
>> +     * Now go back over the range trying to reclaim the swap cache. This is
>> +     * more efficient for large folios because we will only try to reclaim
>> +     * the swap once per folio in the common case. If we do
>> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>> +     * latter will get a reference and lock the folio for every individual
>> +     * page but will only succeed once the swap slot for every subpage is
>> +     * zero.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset += nr) {
>> +        nr = 1;
>> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> 
> Here we use READ_ONCE() only, above data_race(). Hmmm.

Yes. I think this is correct.

READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
used. I believe READ_ONCE() is required here because we don't have a lock and we
want to make sure we read it in a non-tearing manner.

We don't need the READ_ONCE() above since we don't care about the exact value -
only that it's not 0 (because we should be holding a ref). So do a plain access
to give the compiler a bit more freedom. But we need to mark that with
data_race() to stop KCSAN from complaining.

> 
>> +            /*
>> +             * Folios are always naturally aligned in swap so
>> +             * advance forward to the next boundary. Zero means no
>> +             * folio was found for the swap entry, so advance by 1
>> +             * in this case. Negative value means folio was found
>> +             * but could not be reclaimed. Here we can still advance
>> +             * to the next boundary.
>> +             */
>> +            nr = __try_to_reclaim_swap(si, offset,
>>                             TTRS_UNMAPPED | TTRS_FULL);
>> -        put_swap_device(p);
>> +            if (nr == 0)
>> +                nr = 1;
>> +            else if (nr < 0)
>> +                nr = -nr;
>> +            nr = ALIGN(offset + 1, nr) - offset;
>> +        }
> 
> Apart from that nothing jumped at me.

Thanks for the review!
David Hildenbrand April 8, 2024, 9:43 a.m. UTC | #2
>>> +
>>> +/**
>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>> + * @start_ptep: Page table pointer for the first entry.
>>> + * @max_nr: The maximum number of table entries to consider.
>>> + * @entry: Swap entry recovered from the first table entry.
>>> + *
>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>> + * containing swap entries all with consecutive offsets and targeting the same
>>> + * swap type.
>>> + *
>>
>> Likely you should document that any swp pte bits are ignored? ()
> 
> Sorry I don't understand this comment. I thought any non-none, non-present PTE
> was always considered to contain only a "swap entry" and a swap entry consists
> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
> are not part of the swap entry?


pte_swp_soft_dirty()
pte_swp_clear_exclusive()
pte_swp_uffd_wp()

Are PTE bits used for swp PTE.

There is also dirty/young for migration entries, but that's not of a 
concern here, because we stop for non_swap_entry().

> 
> 
>>
>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>> + * cannot exceed a single page table.
>>> + *
>>> + * Return: the number of table entries in the batch.
>>> + */
>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>> +                 swp_entry_t entry)
>>> +{
>>> +    const pte_t *end_ptep = start_ptep + max_nr;
>>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>>> +    unsigned int expected_type = swp_type(entry);
>>> +    pte_t *ptep = start_ptep + 1;
>>> +
>>> +    VM_WARN_ON(max_nr < 1);
>>> +    VM_WARN_ON(non_swap_entry(entry));
>>> +
>>> +    while (ptep < end_ptep) {
>>> +        pte_t pte = ptep_get(ptep);
>>> +
>>> +        if (pte_none(pte) || pte_present(pte))
>>> +            break;
>>> +
>>> +        entry = pte_to_swp_entry(pte);
>>> +
>>> +        if (non_swap_entry(entry) ||
>>> +            swp_type(entry) != expected_type ||
>>> +            swp_offset(entry) != expected_offset)
>>> +            break;
>>> +
>>> +        expected_offset++;
>>> +        ptep++;
>>> +    }
>>> +
>>> +    return ptep - start_ptep;
>>> +}
>>
>> Looks very clean :)
>>
>> I was wondering whether we could similarly construct the expected swp PTE and
>> only check pte_same.
>>
>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
>>
>> ... or have a variant to increase only the swp offset for an existing pte. But
>> non-trivial due to the arch-dependent format.
>>
>> But then, we'd fail on mismatch of other swp pte bits.
> 
> Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...
> 
>>
>>
>> On swapin, when reusing this function (likely!), we'll might to make sure that
>> the PTE bits match as well.
>>
>> See below regarding uffd-wp.
>>
>>
>>>    #endif /* CONFIG_MMU */
>>>      void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 1f77a51baaac..070bedb4996e 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>        struct folio *folio;
>>>        int nr_swap = 0;
>>>        unsigned long next;
>>> +    int nr, max_nr;
>>>          next = pmd_addr_end(addr, end);
>>>        if (pmd_trans_huge(*pmd))
>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>            return 0;
>>>        flush_tlb_batched_pending(mm);
>>>        arch_enter_lazy_mmu_mode();
>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>> +        nr = 1;
>>>            ptent = ptep_get(pte);
>>>              if (pte_none(ptent))
>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>                  entry = pte_to_swp_entry(ptent);
>>>                if (!non_swap_entry(entry)) {
>>> -                nr_swap--;
>>> -                free_swap_and_cache(entry);
>>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> +                max_nr = (end - addr) / PAGE_SIZE;
>>> +                nr = swap_pte_batch(pte, max_nr, entry);
>>> +                nr_swap -= nr;
>>> +                free_swap_and_cache_nr(entry, nr);
>>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>                } else if (is_hwpoison_entry(entry) ||
>>>                       is_poisoned_swp_entry(entry)) {
>>>                    pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 7dc6c3d9fa83..ef2968894718 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>>> *tlb,
>>>                    folio_remove_rmap_pte(folio, page, vma);
>>>                folio_put(folio);
>>>            } else if (!non_swap_entry(entry)) {
>>> -            /* Genuine swap entry, hence a private anon page */
>>> +            max_nr = (end - addr) / PAGE_SIZE;
>>> +            nr = swap_pte_batch(pte, max_nr, entry);
>>> +            /* Genuine swap entries, hence a private anon pages */
>>>                if (!should_zap_cows(details))
>>>                    continue;
>>> -            rss[MM_SWAPENTS]--;
>>> -            if (unlikely(!free_swap_and_cache(entry)))
>>> -                print_bad_pte(vma, addr, ptent, NULL);
>>> +            rss[MM_SWAPENTS] -= nr;
>>> +            free_swap_and_cache_nr(entry, nr);
>>>            } else if (is_migration_entry(entry)) {
>>>                folio = pfn_swap_entry_folio(entry);
>>>                if (!should_zap_folio(details, folio))
>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>                pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>                WARN_ON_ONCE(1);
>>>            }
>>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>
>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
>>
>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
>>
>> On mixture, you either lose some or place too many markers.
> 
> What path are you concerned about here? I don't get how what you describe can
> happen? swap_pte_batch() will only give me a batch of actual swap entries and
> actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
> "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
> on the VMA state at swap-in? I think you are telling me that it's persisted
> across the swap per-pte?

Please see zap_install_uffd_wp_if_needed():

if (unlikely(pte_swp_uffd_wp_any(pteval)))
	arm_uffd_pte = true;

The PTEs (swp PTEs to be precise) contain uffd-wp informtation.

[...]

>>> +    /*
>>> +     * Short-circuit the below loop if none of the entries had their
>>> +     * reference drop to zero.
>>> +     */
>>> +    if (!any_only_cache)
>>> +        goto out;
>>>    -        count = __swap_entry_free(p, entry);
>>> -        if (count == SWAP_HAS_CACHE)
>>> -            __try_to_reclaim_swap(p, swp_offset(entry),
>>> +    /*
>>> +     * Now go back over the range trying to reclaim the swap cache. This is
>>> +     * more efficient for large folios because we will only try to reclaim
>>> +     * the swap once per folio in the common case. If we do
>>> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>>> +     * latter will get a reference and lock the folio for every individual
>>> +     * page but will only succeed once the swap slot for every subpage is
>>> +     * zero.
>>> +     */
>>> +    for (offset = swp_offset(entry); offset < end; offset += nr) {
>>> +        nr = 1;
>>> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
>>
>> Here we use READ_ONCE() only, above data_race(). Hmmm.
> 
> Yes. I think this is correct.
> 
> READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
> about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
> used. I believe READ_ONCE() is required here because we don't have a lock and we
> want to make sure we read it in a non-tearing manner.
> 
> We don't need the READ_ONCE() above since we don't care about the exact value -
> only that it's not 0 (because we should be holding a ref). So do a plain access
> to give the compiler a bit more freedom. But we need to mark that with
> data_race() to stop KCSAN from complaining.

Okay.
Ryan Roberts April 8, 2024, 10:07 a.m. UTC | #3
On 08/04/2024 10:43, David Hildenbrand wrote:
> 
>>>> +
>>>> +/**
>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>> + * @start_ptep: Page table pointer for the first entry.
>>>> + * @max_nr: The maximum number of table entries to consider.
>>>> + * @entry: Swap entry recovered from the first table entry.
>>>> + *
>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>> + * containing swap entries all with consecutive offsets and targeting the same
>>>> + * swap type.
>>>> + *
>>>
>>> Likely you should document that any swp pte bits are ignored? ()
>>
>> Sorry I don't understand this comment. I thought any non-none, non-present PTE
>> was always considered to contain only a "swap entry" and a swap entry consists
>> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
>> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
>> are not part of the swap entry?
> 
> 
> pte_swp_soft_dirty()
> pte_swp_clear_exclusive()
> pte_swp_uffd_wp()
> 
> Are PTE bits used for swp PTE.

Ahh wow. mind blown. Looks like a massive hack... why not store them in the
arch-independent swap entry, rather than have them squat independently in the PTE?

OK, my implementation is buggy. I'll re-spin to fix this.


> 
> There is also dirty/young for migration entries, but that's not of a concern
> here, because we stop for non_swap_entry().

Looks like these are part of the offset field in the arch-independent swap entry
- much cleaner ;-).

> 
>>
>>
>>>
>>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>>> + * cannot exceed a single page table.
>>>> + *
>>>> + * Return: the number of table entries in the batch.
>>>> + */
>>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>>> +                 swp_entry_t entry)
>>>> +{
>>>> +    const pte_t *end_ptep = start_ptep + max_nr;
>>>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>>>> +    unsigned int expected_type = swp_type(entry);
>>>> +    pte_t *ptep = start_ptep + 1;
>>>> +
>>>> +    VM_WARN_ON(max_nr < 1);
>>>> +    VM_WARN_ON(non_swap_entry(entry));
>>>> +
>>>> +    while (ptep < end_ptep) {
>>>> +        pte_t pte = ptep_get(ptep);
>>>> +
>>>> +        if (pte_none(pte) || pte_present(pte))
>>>> +            break;
>>>> +
>>>> +        entry = pte_to_swp_entry(pte);
>>>> +
>>>> +        if (non_swap_entry(entry) ||
>>>> +            swp_type(entry) != expected_type ||
>>>> +            swp_offset(entry) != expected_offset)
>>>> +            break;
>>>> +
>>>> +        expected_offset++;
>>>> +        ptep++;
>>>> +    }
>>>> +
>>>> +    return ptep - start_ptep;
>>>> +}
>>>
>>> Looks very clean :)
>>>
>>> I was wondering whether we could similarly construct the expected swp PTE and
>>> only check pte_same.
>>>
>>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
>>>
>>> ... or have a variant to increase only the swp offset for an existing pte. But
>>> non-trivial due to the arch-dependent format.
>>>
>>> But then, we'd fail on mismatch of other swp pte bits.
>>
>> Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...
>>
>>>
>>>
>>> On swapin, when reusing this function (likely!), we'll might to make sure that
>>> the PTE bits match as well.
>>>
>>> See below regarding uffd-wp.
>>>
>>>
>>>>    #endif /* CONFIG_MMU */
>>>>      void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 1f77a51baaac..070bedb4996e 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>        struct folio *folio;
>>>>        int nr_swap = 0;
>>>>        unsigned long next;
>>>> +    int nr, max_nr;
>>>>          next = pmd_addr_end(addr, end);
>>>>        if (pmd_trans_huge(*pmd))
>>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>            return 0;
>>>>        flush_tlb_batched_pending(mm);
>>>>        arch_enter_lazy_mmu_mode();
>>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>>>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>>> +        nr = 1;
>>>>            ptent = ptep_get(pte);
>>>>              if (pte_none(ptent))
>>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>                  entry = pte_to_swp_entry(ptent);
>>>>                if (!non_swap_entry(entry)) {
>>>> -                nr_swap--;
>>>> -                free_swap_and_cache(entry);
>>>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> +                max_nr = (end - addr) / PAGE_SIZE;
>>>> +                nr = swap_pte_batch(pte, max_nr, entry);
>>>> +                nr_swap -= nr;
>>>> +                free_swap_and_cache_nr(entry, nr);
>>>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>>                } else if (is_hwpoison_entry(entry) ||
>>>>                       is_poisoned_swp_entry(entry)) {
>>>>                    pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 7dc6c3d9fa83..ef2968894718 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>>>> *tlb,
>>>>                    folio_remove_rmap_pte(folio, page, vma);
>>>>                folio_put(folio);
>>>>            } else if (!non_swap_entry(entry)) {
>>>> -            /* Genuine swap entry, hence a private anon page */
>>>> +            max_nr = (end - addr) / PAGE_SIZE;
>>>> +            nr = swap_pte_batch(pte, max_nr, entry);
>>>> +            /* Genuine swap entries, hence a private anon pages */
>>>>                if (!should_zap_cows(details))
>>>>                    continue;
>>>> -            rss[MM_SWAPENTS]--;
>>>> -            if (unlikely(!free_swap_and_cache(entry)))
>>>> -                print_bad_pte(vma, addr, ptent, NULL);
>>>> +            rss[MM_SWAPENTS] -= nr;
>>>> +            free_swap_and_cache_nr(entry, nr);
>>>>            } else if (is_migration_entry(entry)) {
>>>>                folio = pfn_swap_entry_folio(entry);
>>>>                if (!should_zap_folio(details, folio))
>>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather
>>>> *tlb,
>>>>                pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>>                WARN_ON_ONCE(1);
>>>>            }
>>>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>
>>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
>>>
>>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
>>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
>>>
>>> On mixture, you either lose some or place too many markers.
>>
>> What path are you concerned about here? I don't get how what you describe can
>> happen? swap_pte_batch() will only give me a batch of actual swap entries and
>> actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
>> "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
>> on the VMA state at swap-in? I think you are telling me that it's persisted
>> across the swap per-pte?
> 
> Please see zap_install_uffd_wp_if_needed():
> 
> if (unlikely(pte_swp_uffd_wp_any(pteval)))
>     arm_uffd_pte = true;
> 
> The PTEs (swp PTEs to be precise) contain uffd-wp informtation.
> 
> [...]
> 
>>>> +    /*
>>>> +     * Short-circuit the below loop if none of the entries had their
>>>> +     * reference drop to zero.
>>>> +     */
>>>> +    if (!any_only_cache)
>>>> +        goto out;
>>>>    -        count = __swap_entry_free(p, entry);
>>>> -        if (count == SWAP_HAS_CACHE)
>>>> -            __try_to_reclaim_swap(p, swp_offset(entry),
>>>> +    /*
>>>> +     * Now go back over the range trying to reclaim the swap cache. This is
>>>> +     * more efficient for large folios because we will only try to reclaim
>>>> +     * the swap once per folio in the common case. If we do
>>>> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>>>> +     * latter will get a reference and lock the folio for every individual
>>>> +     * page but will only succeed once the swap slot for every subpage is
>>>> +     * zero.
>>>> +     */
>>>> +    for (offset = swp_offset(entry); offset < end; offset += nr) {
>>>> +        nr = 1;
>>>> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
>>>
>>> Here we use READ_ONCE() only, above data_race(). Hmmm.
>>
>> Yes. I think this is correct.
>>
>> READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
>> about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
>> used. I believe READ_ONCE() is required here because we don't have a lock and we
>> want to make sure we read it in a non-tearing manner.
>>
>> We don't need the READ_ONCE() above since we don't care about the exact value -
>> only that it's not 0 (because we should be holding a ref). So do a plain access
>> to give the compiler a bit more freedom. But we need to mark that with
>> data_race() to stop KCSAN from complaining.
> 
> Okay.
>
Ryan Roberts April 8, 2024, 10:39 a.m. UTC | #4
On 08/04/2024 11:24, David Hildenbrand wrote:
> On 08.04.24 12:07, Ryan Roberts wrote:
>> On 08/04/2024 10:43, David Hildenbrand wrote:
>>>
>>>>>> +
>>>>>> +/**
>>>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>>>> + * @start_ptep: Page table pointer for the first entry.
>>>>>> + * @max_nr: The maximum number of table entries to consider.
>>>>>> + * @entry: Swap entry recovered from the first table entry.
>>>>>> + *
>>>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>>>> + * containing swap entries all with consecutive offsets and targeting the
>>>>>> same
>>>>>> + * swap type.
>>>>>> + *
>>>>>
>>>>> Likely you should document that any swp pte bits are ignored? ()
>>>>
>>>> Sorry I don't understand this comment. I thought any non-none, non-present PTE
>>>> was always considered to contain only a "swap entry" and a swap entry consists
>>>> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
>>>> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE
>>>> that
>>>> are not part of the swap entry?
>>>
>>>
>>> pte_swp_soft_dirty()
>>> pte_swp_clear_exclusive()
>>> pte_swp_uffd_wp()
>>>
>>> Are PTE bits used for swp PTE.
>>
>> Ahh wow. mind blown. Looks like a massive hack... why not store them in the
>> arch-independent swap entry, rather than have them squat independently in the
>> PTE?
> 
> I think that was discussed at some point, but it not only requires quite some
> churn to change it (all that swp entry code is a mess), these bits are
> conceptually really per PTE and not something you would want to pass into actual
> swap handling code that couldn't care less about all of these.
> 
> I looked at this when I added SWP exclusive, but accidentally losing the SWP
> exclusive marker when converting back and forth made me go the PTE route instead.
> 
> Then, the available PTE bits are a bit scattered on some architectures, and
> converting entry<->PTE gets even uglier if we don't want to "lose" available bits.
> 
> Probably the whole "unsigned long swp_entry" stuff should be replaced by a
> proper struct where we could more easily add flags and have the arch code handle
> the conversion to the PTE just once. The arch-specific swp_entry stuff is
> another nasty thing IMHO.

Yep understood. I'll file this under "there be dragons". Thanks for the explanation.

> 
>>
>> OK, my implementation is buggy. I'll re-spin to fix this.
>>
>>
>>>
>>> There is also dirty/young for migration entries, but that's not of a concern
>>> here, because we stop for non_swap_entry().
>>
>> Looks like these are part of the offset field in the arch-independent swap entry
>> - much cleaner ;-).
> 
> Note that it only applies to migration entries, and only when we have some spare
> bits due to PFN < offset.

Yep got it. Thanks!
Ryan Roberts April 8, 2024, 12:07 p.m. UTC | #5
[...]
> 
> [...]
> 
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
> 
> Likely you should document that any swp pte bits are ignored? ()

Now that I understand what swp pte bits are, I think the simplest thing is to
just make this function always consider the pte bits by using pte_same() as you
suggest below? I don't think there is ever a case for ignoring the swp pte bits?
And then I don't need to do anything special for uffd-wp either (below you
suggested not doing batching when the VMA has uffd enabled).

Any concerns?

> 
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> +                 swp_entry_t entry)
>> +{
>> +    const pte_t *end_ptep = start_ptep + max_nr;
>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>> +    unsigned int expected_type = swp_type(entry);
>> +    pte_t *ptep = start_ptep + 1;
>> +
>> +    VM_WARN_ON(max_nr < 1);
>> +    VM_WARN_ON(non_swap_entry(entry));
>> +
>> +    while (ptep < end_ptep) {
>> +        pte_t pte = ptep_get(ptep);
>> +
>> +        if (pte_none(pte) || pte_present(pte))
>> +            break;
>> +
>> +        entry = pte_to_swp_entry(pte);
>> +
>> +        if (non_swap_entry(entry) ||
>> +            swp_type(entry) != expected_type ||
>> +            swp_offset(entry) != expected_offset)
>> +            break;
>> +
>> +        expected_offset++;
>> +        ptep++;
>> +    }
>> +
>> +    return ptep - start_ptep;
>> +}
> 
> Looks very clean :)
> 
> I was wondering whether we could similarly construct the expected swp PTE and
> only check pte_same.
> 
> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));

So planning to do this.

> 
> ... or have a variant to increase only the swp offset for an existing pte. But
> non-trivial due to the arch-dependent format.

not this - I agree this will be difficult due to per-arch changes. I'd rather
just do the generic version and leave the compiler to do the best it can to
simplify and optimize.

> 
> But then, we'd fail on mismatch of other swp pte bits.
> 
> 
> On swapin, when reusing this function (likely!), we'll might to make sure that
> the PTE bits match as well.
> 
> See below regarding uffd-wp.
> 
> 
>>   #endif /* CONFIG_MMU */
>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>       struct folio *folio;
>>       int nr_swap = 0;
>>       unsigned long next;
>> +    int nr, max_nr;
>>         next = pmd_addr_end(addr, end);
>>       if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>           return 0;
>>       flush_tlb_batched_pending(mm);
>>       arch_enter_lazy_mmu_mode();
>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> +        nr = 1;
>>           ptent = ptep_get(pte);
>>             if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>                 entry = pte_to_swp_entry(ptent);
>>               if (!non_swap_entry(entry)) {
>> -                nr_swap--;
>> -                free_swap_and_cache(entry);
>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +                max_nr = (end - addr) / PAGE_SIZE;
>> +                nr = swap_pte_batch(pte, max_nr, entry);
>> +                nr_swap -= nr;
>> +                free_swap_and_cache_nr(entry, nr);
>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>               } else if (is_hwpoison_entry(entry) ||
>>                      is_poisoned_swp_entry(entry)) {
>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7dc6c3d9fa83..ef2968894718 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>> *tlb,
>>                   folio_remove_rmap_pte(folio, page, vma);
>>               folio_put(folio);
>>           } else if (!non_swap_entry(entry)) {
>> -            /* Genuine swap entry, hence a private anon page */
>> +            max_nr = (end - addr) / PAGE_SIZE;
>> +            nr = swap_pte_batch(pte, max_nr, entry);
>> +            /* Genuine swap entries, hence a private anon pages */
>>               if (!should_zap_cows(details))
>>                   continue;
>> -            rss[MM_SWAPENTS]--;
>> -            if (unlikely(!free_swap_and_cache(entry)))
>> -                print_bad_pte(vma, addr, ptent, NULL);
>> +            rss[MM_SWAPENTS] -= nr;
>> +            free_swap_and_cache_nr(entry, nr);
>>           } else if (is_migration_entry(entry)) {
>>               folio = pfn_swap_entry_folio(entry);
>>               if (!should_zap_folio(details, folio))
>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>               WARN_ON_ONCE(1);
>>           }
>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> 
> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
> 
> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
> 
> On mixture, you either lose some or place too many markers.
> 
> A simple workaround would be to disable any such batching if the VMA does have
> uffd-wp enabled.

Rather than this, I'll just consider all the swp pte bits when batching.

> 
>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

[...]
Ryan Roberts April 8, 2024, 12:47 p.m. UTC | #6
On 08/04/2024 13:07, Ryan Roberts wrote:
> [...]
>>
>> [...]
>>
>>> +
>>> +/**
>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>> + * @start_ptep: Page table pointer for the first entry.
>>> + * @max_nr: The maximum number of table entries to consider.
>>> + * @entry: Swap entry recovered from the first table entry.
>>> + *
>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>> + * containing swap entries all with consecutive offsets and targeting the same
>>> + * swap type.
>>> + *
>>
>> Likely you should document that any swp pte bits are ignored? ()
> 
> Now that I understand what swp pte bits are, I think the simplest thing is to
> just make this function always consider the pte bits by using pte_same() as you
> suggest below? I don't think there is ever a case for ignoring the swp pte bits?
> And then I don't need to do anything special for uffd-wp either (below you
> suggested not doing batching when the VMA has uffd enabled).
> 
> Any concerns?
> 
>>
>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>> + * cannot exceed a single page table.
>>> + *
>>> + * Return: the number of table entries in the batch.
>>> + */
>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>> +                 swp_entry_t entry)
>>> +{
>>> +    const pte_t *end_ptep = start_ptep + max_nr;
>>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>>> +    unsigned int expected_type = swp_type(entry);
>>> +    pte_t *ptep = start_ptep + 1;
>>> +
>>> +    VM_WARN_ON(max_nr < 1);
>>> +    VM_WARN_ON(non_swap_entry(entry));
>>> +
>>> +    while (ptep < end_ptep) {
>>> +        pte_t pte = ptep_get(ptep);
>>> +
>>> +        if (pte_none(pte) || pte_present(pte))
>>> +            break;
>>> +
>>> +        entry = pte_to_swp_entry(pte);
>>> +
>>> +        if (non_swap_entry(entry) ||
>>> +            swp_type(entry) != expected_type ||
>>> +            swp_offset(entry) != expected_offset)
>>> +            break;
>>> +
>>> +        expected_offset++;
>>> +        ptep++;
>>> +    }
>>> +
>>> +    return ptep - start_ptep;
>>> +}
>>
>> Looks very clean :)
>>
>> I was wondering whether we could similarly construct the expected swp PTE and
>> only check pte_same.
>>
>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
> 
> So planning to do this.

Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex.

If we can safely assume all offset bits are contiguous in every per-arch representation then we can do:

static inline pte_t pte_next_swp_offset(pte_t pte)
{
	pte_t offset_inc = __swp_entry_to_pte(__swp_entry(0, 1));

	return __pte(pte_val(pte) + pte_val(offset_inc));
}

Or if not:

static inline pte_t pte_next_swp_offset(pte_t pte)
{
	swp_entry_t entry = pte_to_swp_entry(pte);
	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), swp_offset(entry) + 1));

	if (pte_swp_soft_dirty(pte))
		new = pte_swp_mksoft_dirty(new);
	if (pte_swp_exclusive(pte))
		new = pte_swp_mkexclusive(new);
	if (pte_swp_uffd_wp(pte))
		new = pte_swp_mkuffd_wp(new);

	return new;
}

Then swap_pte_batch() becomes:

static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
{
	pte_t expected_pte = pte_next_swp_offset(pte);
	const pte_t *end_ptep = start_ptep + max_nr;
	pte_t *ptep = start_ptep + 1;

	VM_WARN_ON(max_nr < 1);
	VM_WARN_ON(!is_swap_pte(pte));
	VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));

	while (ptep < end_ptep) {
		pte = ptep_get(ptep);

		if (!pte_same(pte, expected_pte))
			break;

		expected_pte = pte_next_swp_offset(expected_pte);
		ptep++;
	}

	return ptep - start_ptep;
}

Would you be happy with either of these? I'll go look if we can assume the offset bits are always contiguous.


> 
>>
>> ... or have a variant to increase only the swp offset for an existing pte. But
>> non-trivial due to the arch-dependent format.
> 
> not this - I agree this will be difficult due to per-arch changes. I'd rather
> just do the generic version and leave the compiler to do the best it can to
> simplify and optimize.
> 
>>
>> But then, we'd fail on mismatch of other swp pte bits.
>>
>>
>> On swapin, when reusing this function (likely!), we'll might to make sure that
>> the PTE bits match as well.
>>
>> See below regarding uffd-wp.
>>
>>
>>>   #endif /* CONFIG_MMU */
>>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 1f77a51baaac..070bedb4996e 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>       struct folio *folio;
>>>       int nr_swap = 0;
>>>       unsigned long next;
>>> +    int nr, max_nr;
>>>         next = pmd_addr_end(addr, end);
>>>       if (pmd_trans_huge(*pmd))
>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>           return 0;
>>>       flush_tlb_batched_pending(mm);
>>>       arch_enter_lazy_mmu_mode();
>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>> +        nr = 1;
>>>           ptent = ptep_get(pte);
>>>             if (pte_none(ptent))
>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>> long addr,
>>>                 entry = pte_to_swp_entry(ptent);
>>>               if (!non_swap_entry(entry)) {
>>> -                nr_swap--;
>>> -                free_swap_and_cache(entry);
>>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> +                max_nr = (end - addr) / PAGE_SIZE;
>>> +                nr = swap_pte_batch(pte, max_nr, entry);
>>> +                nr_swap -= nr;
>>> +                free_swap_and_cache_nr(entry, nr);
>>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>               } else if (is_hwpoison_entry(entry) ||
>>>                      is_poisoned_swp_entry(entry)) {
>>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 7dc6c3d9fa83..ef2968894718 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>>> *tlb,
>>>                   folio_remove_rmap_pte(folio, page, vma);
>>>               folio_put(folio);
>>>           } else if (!non_swap_entry(entry)) {
>>> -            /* Genuine swap entry, hence a private anon page */
>>> +            max_nr = (end - addr) / PAGE_SIZE;
>>> +            nr = swap_pte_batch(pte, max_nr, entry);
>>> +            /* Genuine swap entries, hence a private anon pages */
>>>               if (!should_zap_cows(details))
>>>                   continue;
>>> -            rss[MM_SWAPENTS]--;
>>> -            if (unlikely(!free_swap_and_cache(entry)))
>>> -                print_bad_pte(vma, addr, ptent, NULL);
>>> +            rss[MM_SWAPENTS] -= nr;
>>> +            free_swap_and_cache_nr(entry, nr);
>>>           } else if (is_migration_entry(entry)) {
>>>               folio = pfn_swap_entry_folio(entry);
>>>               if (!should_zap_folio(details, folio))
>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>               WARN_ON_ONCE(1);
>>>           }
>>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>
>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
>>
>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
>>
>> On mixture, you either lose some or place too many markers.
>>
>> A simple workaround would be to disable any such batching if the VMA does have
>> uffd-wp enabled.
> 
> Rather than this, I'll just consider all the swp pte bits when batching.
> 
>>
>>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
> 
> [...]
>
Ryan Roberts April 8, 2024, 1:27 p.m. UTC | #7
On 08/04/2024 13:47, Ryan Roberts wrote:
> On 08/04/2024 13:07, Ryan Roberts wrote:
>> [...]
>>>
>>> [...]
>>>
>>>> +
>>>> +/**
>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>> + * @start_ptep: Page table pointer for the first entry.
>>>> + * @max_nr: The maximum number of table entries to consider.
>>>> + * @entry: Swap entry recovered from the first table entry.
>>>> + *
>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>> + * containing swap entries all with consecutive offsets and targeting the same
>>>> + * swap type.
>>>> + *
>>>
>>> Likely you should document that any swp pte bits are ignored? ()
>>
>> Now that I understand what swp pte bits are, I think the simplest thing is to
>> just make this function always consider the pte bits by using pte_same() as you
>> suggest below? I don't think there is ever a case for ignoring the swp pte bits?
>> And then I don't need to do anything special for uffd-wp either (below you
>> suggested not doing batching when the VMA has uffd enabled).
>>
>> Any concerns?
>>
>>>
>>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>>> + * cannot exceed a single page table.
>>>> + *
>>>> + * Return: the number of table entries in the batch.
>>>> + */
>>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>>> +                 swp_entry_t entry)
>>>> +{
>>>> +    const pte_t *end_ptep = start_ptep + max_nr;
>>>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>>>> +    unsigned int expected_type = swp_type(entry);
>>>> +    pte_t *ptep = start_ptep + 1;
>>>> +
>>>> +    VM_WARN_ON(max_nr < 1);
>>>> +    VM_WARN_ON(non_swap_entry(entry));
>>>> +
>>>> +    while (ptep < end_ptep) {
>>>> +        pte_t pte = ptep_get(ptep);
>>>> +
>>>> +        if (pte_none(pte) || pte_present(pte))
>>>> +            break;
>>>> +
>>>> +        entry = pte_to_swp_entry(pte);
>>>> +
>>>> +        if (non_swap_entry(entry) ||
>>>> +            swp_type(entry) != expected_type ||
>>>> +            swp_offset(entry) != expected_offset)
>>>> +            break;
>>>> +
>>>> +        expected_offset++;
>>>> +        ptep++;
>>>> +    }
>>>> +
>>>> +    return ptep - start_ptep;
>>>> +}
>>>
>>> Looks very clean :)
>>>
>>> I was wondering whether we could similarly construct the expected swp PTE and
>>> only check pte_same.
>>>
>>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
>>
>> So planning to do this.
> 
> Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex.
> 
> If we can safely assume all offset bits are contiguous in every per-arch representation then we can do:

Looks like at least csky and hexagon store the offset in discontiguous regions.
So it will have to be the second approach if we want to avoid anything
arch-specific. I'll assume that for now; we can always specialize
pte_next_swp_offset() per-arch in the future if needed.

> 
> static inline pte_t pte_next_swp_offset(pte_t pte)
> {
> 	pte_t offset_inc = __swp_entry_to_pte(__swp_entry(0, 1));
> 
> 	return __pte(pte_val(pte) + pte_val(offset_inc));
> }
> 
> Or if not:
> 
> static inline pte_t pte_next_swp_offset(pte_t pte)
> {
> 	swp_entry_t entry = pte_to_swp_entry(pte);
> 	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), swp_offset(entry) + 1));
> 
> 	if (pte_swp_soft_dirty(pte))
> 		new = pte_swp_mksoft_dirty(new);
> 	if (pte_swp_exclusive(pte))
> 		new = pte_swp_mkexclusive(new);
> 	if (pte_swp_uffd_wp(pte))
> 		new = pte_swp_mkuffd_wp(new);
> 
> 	return new;
> }
> 
> Then swap_pte_batch() becomes:
> 
> static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> {
> 	pte_t expected_pte = pte_next_swp_offset(pte);
> 	const pte_t *end_ptep = start_ptep + max_nr;
> 	pte_t *ptep = start_ptep + 1;
> 
> 	VM_WARN_ON(max_nr < 1);
> 	VM_WARN_ON(!is_swap_pte(pte));
> 	VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> 
> 	while (ptep < end_ptep) {
> 		pte = ptep_get(ptep);
> 
> 		if (!pte_same(pte, expected_pte))
> 			break;
> 
> 		expected_pte = pte_next_swp_offset(expected_pte);
> 		ptep++;
> 	}
> 
> 	return ptep - start_ptep;
> }
> 
> Would you be happy with either of these? I'll go look if we can assume the offset bits are always contiguous.
> 
> 
>>
>>>
>>> ... or have a variant to increase only the swp offset for an existing pte. But
>>> non-trivial due to the arch-dependent format.
>>
>> not this - I agree this will be difficult due to per-arch changes. I'd rather
>> just do the generic version and leave the compiler to do the best it can to
>> simplify and optimize.
>>
>>>
>>> But then, we'd fail on mismatch of other swp pte bits.
>>>
>>>
>>> On swapin, when reusing this function (likely!), we'll might to make sure that
>>> the PTE bits match as well.
>>>
>>> See below regarding uffd-wp.
>>>
>>>
>>>>   #endif /* CONFIG_MMU */
>>>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 1f77a51baaac..070bedb4996e 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>       struct folio *folio;
>>>>       int nr_swap = 0;
>>>>       unsigned long next;
>>>> +    int nr, max_nr;
>>>>         next = pmd_addr_end(addr, end);
>>>>       if (pmd_trans_huge(*pmd))
>>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>           return 0;
>>>>       flush_tlb_batched_pending(mm);
>>>>       arch_enter_lazy_mmu_mode();
>>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>>>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>>> +        nr = 1;
>>>>           ptent = ptep_get(pte);
>>>>             if (pte_none(ptent))
>>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>>>> long addr,
>>>>                 entry = pte_to_swp_entry(ptent);
>>>>               if (!non_swap_entry(entry)) {
>>>> -                nr_swap--;
>>>> -                free_swap_and_cache(entry);
>>>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> +                max_nr = (end - addr) / PAGE_SIZE;
>>>> +                nr = swap_pte_batch(pte, max_nr, entry);
>>>> +                nr_swap -= nr;
>>>> +                free_swap_and_cache_nr(entry, nr);
>>>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>>               } else if (is_hwpoison_entry(entry) ||
>>>>                      is_poisoned_swp_entry(entry)) {
>>>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 7dc6c3d9fa83..ef2968894718 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>>>> *tlb,
>>>>                   folio_remove_rmap_pte(folio, page, vma);
>>>>               folio_put(folio);
>>>>           } else if (!non_swap_entry(entry)) {
>>>> -            /* Genuine swap entry, hence a private anon page */
>>>> +            max_nr = (end - addr) / PAGE_SIZE;
>>>> +            nr = swap_pte_batch(pte, max_nr, entry);
>>>> +            /* Genuine swap entries, hence a private anon pages */
>>>>               if (!should_zap_cows(details))
>>>>                   continue;
>>>> -            rss[MM_SWAPENTS]--;
>>>> -            if (unlikely(!free_swap_and_cache(entry)))
>>>> -                print_bad_pte(vma, addr, ptent, NULL);
>>>> +            rss[MM_SWAPENTS] -= nr;
>>>> +            free_swap_and_cache_nr(entry, nr);
>>>>           } else if (is_migration_entry(entry)) {
>>>>               folio = pfn_swap_entry_folio(entry);
>>>>               if (!should_zap_folio(details, folio))
>>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>>               WARN_ON_ONCE(1);
>>>>           }
>>>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>
>>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
>>>
>>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
>>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
>>>
>>> On mixture, you either lose some or place too many markers.
>>>
>>> A simple workaround would be to disable any such batching if the VMA does have
>>> uffd-wp enabled.
>>
>> Rather than this, I'll just consider all the swp pte bits when batching.
>>
>>>
>>>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>
>> [...]
>>
>
David Hildenbrand April 8, 2024, 3:13 p.m. UTC | #8
On 08.04.24 15:27, Ryan Roberts wrote:
> On 08/04/2024 13:47, Ryan Roberts wrote:
>> On 08/04/2024 13:07, Ryan Roberts wrote:
>>> [...]
>>>>
>>>> [...]
>>>>
>>>>> +
>>>>> +/**
>>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>>> + * @start_ptep: Page table pointer for the first entry.
>>>>> + * @max_nr: The maximum number of table entries to consider.
>>>>> + * @entry: Swap entry recovered from the first table entry.
>>>>> + *
>>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>>> + * containing swap entries all with consecutive offsets and targeting the same
>>>>> + * swap type.
>>>>> + *
>>>>
>>>> Likely you should document that any swp pte bits are ignored? ()
>>>
>>> Now that I understand what swp pte bits are, I think the simplest thing is to
>>> just make this function always consider the pte bits by using pte_same() as you
>>> suggest below? I don't think there is ever a case for ignoring the swp pte bits?
>>> And then I don't need to do anything special for uffd-wp either (below you
>>> suggested not doing batching when the VMA has uffd enabled).
>>>
>>> Any concerns?
>>>
>>>>
>>>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>>>> + * cannot exceed a single page table.
>>>>> + *
>>>>> + * Return: the number of table entries in the batch.
>>>>> + */
>>>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>>>> +                 swp_entry_t entry)
>>>>> +{
>>>>> +    const pte_t *end_ptep = start_ptep + max_nr;
>>>>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>>>>> +    unsigned int expected_type = swp_type(entry);
>>>>> +    pte_t *ptep = start_ptep + 1;
>>>>> +
>>>>> +    VM_WARN_ON(max_nr < 1);
>>>>> +    VM_WARN_ON(non_swap_entry(entry));
>>>>> +
>>>>> +    while (ptep < end_ptep) {
>>>>> +        pte_t pte = ptep_get(ptep);
>>>>> +
>>>>> +        if (pte_none(pte) || pte_present(pte))
>>>>> +            break;
>>>>> +
>>>>> +        entry = pte_to_swp_entry(pte);
>>>>> +
>>>>> +        if (non_swap_entry(entry) ||
>>>>> +            swp_type(entry) != expected_type ||
>>>>> +            swp_offset(entry) != expected_offset)
>>>>> +            break;
>>>>> +
>>>>> +        expected_offset++;
>>>>> +        ptep++;
>>>>> +    }
>>>>> +
>>>>> +    return ptep - start_ptep;
>>>>> +}
>>>>
>>>> Looks very clean :)
>>>>
>>>> I was wondering whether we could similarly construct the expected swp PTE and
>>>> only check pte_same.
>>>>
>>>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
>>>
>>> So planning to do this.
>>
>> Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex.
>>
>> If we can safely assume all offset bits are contiguous in every per-arch representation then we can do:
> 
> Looks like at least csky and hexagon store the offset in discontiguous regions.
> So it will have to be the second approach if we want to avoid anything
> arch-specific. I'll assume that for now; we can always specialize
> pte_next_swp_offset() per-arch in the future if needed.

Sounds good. Just have a generic variant as you proposed, and add the 
per-arch one if really required later.
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a3fc8150b047..0278259f7078 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -708,6 +708,34 @@  static inline void pte_clear_not_present_full(struct mm_struct *mm,
 }
 #endif
 
+#ifndef clear_not_present_full_ptes
+/**
+ * clear_not_present_full_ptes - Clear consecutive not present PTEs.
+ * @mm: Address space the ptes represent.
+ * @addr: Address of the first pte.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over pte_clear_not_present_full().
+ *
+ * Context: The caller holds the page table lock.  The PTEs are all not present.
+ * The PTEs are all in the same PMD.
+ */
+static inline void clear_not_present_full_ptes(struct mm_struct *mm,
+		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		pte_clear_not_present_full(mm, addr, ptep, full);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
 			      unsigned long address,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6f78198f000..5737236dc3ce 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -471,7 +471,7 @@  extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -520,8 +520,9 @@  static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
-#define free_swap_and_cache(e) is_pfn_swap_entry(e)
+static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+}
 
 static inline void free_swap_cache(struct folio *folio)
 {
@@ -589,6 +590,11 @@  static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
+static inline void free_swap_and_cache(swp_entry_t entry)
+{
+	free_swap_and_cache_nr(entry, 1);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/internal.h b/mm/internal.h
index d567381b12cc..88705ab4c50a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,8 @@ 
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/tracepoint-defs.h>
 
 struct folio_batch;
@@ -189,6 +191,52 @@  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 
 	return min(ptep - start_ptep, max_nr);
 }
+
+/**
+ * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
+ * @start_ptep: Page table pointer for the first entry.
+ * @max_nr: The maximum number of table entries to consider.
+ * @entry: Swap entry recovered from the first table entry.
+ *
+ * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
+ * containing swap entries all with consecutive offsets and targeting the same
+ * swap type.
+ *
+ * max_nr must be at least one and must be limited by the caller so scanning
+ * cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
+				 swp_entry_t entry)
+{
+	const pte_t *end_ptep = start_ptep + max_nr;
+	unsigned long expected_offset = swp_offset(entry) + 1;
+	unsigned int expected_type = swp_type(entry);
+	pte_t *ptep = start_ptep + 1;
+
+	VM_WARN_ON(max_nr < 1);
+	VM_WARN_ON(non_swap_entry(entry));
+
+	while (ptep < end_ptep) {
+		pte_t pte = ptep_get(ptep);
+
+		if (pte_none(pte) || pte_present(pte))
+			break;
+
+		entry = pte_to_swp_entry(pte);
+
+		if (non_swap_entry(entry) ||
+		    swp_type(entry) != expected_type ||
+		    swp_offset(entry) != expected_offset)
+			break;
+
+		expected_offset++;
+		ptep++;
+	}
+
+	return ptep - start_ptep;
+}
 #endif /* CONFIG_MMU */
 
 void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
diff --git a/mm/madvise.c b/mm/madvise.c
index 1f77a51baaac..070bedb4996e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -628,6 +628,7 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct folio *folio;
 	int nr_swap = 0;
 	unsigned long next;
+	int nr, max_nr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -640,7 +641,8 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
-	for (; addr != end; pte++, addr += PAGE_SIZE) {
+	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
+		nr = 1;
 		ptent = ptep_get(pte);
 
 		if (pte_none(ptent))
@@ -655,9 +657,11 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 			entry = pte_to_swp_entry(ptent);
 			if (!non_swap_entry(entry)) {
-				nr_swap--;
-				free_swap_and_cache(entry);
-				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+				max_nr = (end - addr) / PAGE_SIZE;
+				nr = swap_pte_batch(pte, max_nr, entry);
+				nr_swap -= nr;
+				free_swap_and_cache_nr(entry, nr);
+				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (is_hwpoison_entry(entry) ||
 				   is_poisoned_swp_entry(entry)) {
 				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
diff --git a/mm/memory.c b/mm/memory.c
index 7dc6c3d9fa83..ef2968894718 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1637,12 +1637,13 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				folio_remove_rmap_pte(folio, page, vma);
 			folio_put(folio);
 		} else if (!non_swap_entry(entry)) {
-			/* Genuine swap entry, hence a private anon page */
+			max_nr = (end - addr) / PAGE_SIZE;
+			nr = swap_pte_batch(pte, max_nr, entry);
+			/* Genuine swap entries, hence a private anon pages */
 			if (!should_zap_cows(details))
 				continue;
-			rss[MM_SWAPENTS]--;
-			if (unlikely(!free_swap_and_cache(entry)))
-				print_bad_pte(vma, addr, ptent, NULL);
+			rss[MM_SWAPENTS] -= nr;
+			free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
 			if (!should_zap_folio(details, folio))
@@ -1665,8 +1666,8 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
 			WARN_ON_ONCE(1);
 		}
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0d44ee2b4f9c..d059de6896c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -130,7 +130,11 @@  static inline unsigned char swap_count(unsigned char ent)
 /* Reclaim the swap entry if swap is getting full*/
 #define TTRS_FULL		0x4
 
-/* returns 1 if swap entry is freed */
+/*
+ * returns number of pages in the folio that backs the swap entry. If positive,
+ * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
+ * folio was associated with the swap entry.
+ */
 static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
@@ -155,6 +159,7 @@  static int __try_to_reclaim_swap(struct swap_info_struct *si,
 			ret = folio_free_swap(folio);
 		folio_unlock(folio);
 	}
+	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
 	folio_put(folio);
 	return ret;
 }
@@ -895,7 +900,7 @@  static int scan_swap_map_slots(struct swap_info_struct *si,
 		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 		spin_lock(&si->lock);
 		/* entry was freed successfully, try to use this again */
-		if (swap_was_freed)
+		if (swap_was_freed > 0)
 			goto checks;
 		goto scan; /* check next one */
 	}
@@ -1572,32 +1577,75 @@  bool folio_free_swap(struct folio *folio)
 	return true;
 }
 
-/*
- * Free the swap entry like above, but also try to
- * free the page cache entry if it is the last user.
- */
-int free_swap_and_cache(swp_entry_t entry)
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 {
-	struct swap_info_struct *p;
+	unsigned long end = swp_offset(entry) + nr;
+	unsigned int type = swp_type(entry);
+	struct swap_info_struct *si;
+	bool any_only_cache = false;
+	unsigned long offset;
 	unsigned char count;
 
 	if (non_swap_entry(entry))
-		return 1;
+		return;
 
-	p = get_swap_device(entry);
-	if (p) {
-		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
-			put_swap_device(p);
-			return 0;
+	si = get_swap_device(entry);
+	if (!si)
+		return;
+
+	if (WARN_ON(end > si->max))
+		goto out;
+
+	/*
+	 * First free all entries in the range.
+	 */
+	for (offset = swp_offset(entry); offset < end; offset++) {
+		if (!WARN_ON(data_race(!si->swap_map[offset]))) {
+			count = __swap_entry_free(si, swp_entry(type, offset));
+			if (count == SWAP_HAS_CACHE)
+				any_only_cache = true;
 		}
+	}
+
+	/*
+	 * Short-circuit the below loop if none of the entries had their
+	 * reference drop to zero.
+	 */
+	if (!any_only_cache)
+		goto out;
 
-		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE)
-			__try_to_reclaim_swap(p, swp_offset(entry),
+	/*
+	 * Now go back over the range trying to reclaim the swap cache. This is
+	 * more efficient for large folios because we will only try to reclaim
+	 * the swap once per folio in the common case. If we do
+	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+	 * latter will get a reference and lock the folio for every individual
+	 * page but will only succeed once the swap slot for every subpage is
+	 * zero.
+	 */
+	for (offset = swp_offset(entry); offset < end; offset += nr) {
+		nr = 1;
+		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+			/*
+			 * Folios are always naturally aligned in swap so
+			 * advance forward to the next boundary. Zero means no
+			 * folio was found for the swap entry, so advance by 1
+			 * in this case. Negative value means folio was found
+			 * but could not be reclaimed. Here we can still advance
+			 * to the next boundary.
+			 */
+			nr = __try_to_reclaim_swap(si, offset,
 					      TTRS_UNMAPPED | TTRS_FULL);
-		put_swap_device(p);
+			if (nr == 0)
+				nr = 1;
+			else if (nr < 0)
+				nr = -nr;
+			nr = ALIGN(offset + 1, nr) - offset;
+		}
 	}
-	return p != NULL;
+
+out:
+	put_swap_device(si);
 }
 
 #ifdef CONFIG_HIBERNATION