diff mbox series

[RFC,v2,1/2] mm: Don't skip swap entry even if zap_details specified

Message ID 20211115134951.85286-2-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Rework zap ptes on swap entries | expand

Commit Message

Peter Xu Nov. 15, 2021, 1:49 p.m. UTC
This check existed since the 1st git commit of Linux repository, but at that
time there's no page migration yet so I think it's okay.

With page migration enabled, it should logically be possible that we zap some
shmem pages during migration.  When that happens, IIUC the old code could have
the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
without decreasing the counters for the migrating entries.  I have no unit test
to prove it as I don't know an easy way to trigger this condition, though.

Besides, the optimization itself is already confusing IMHO to me in a few points:

  - The wording "skip swap entries" is confusing, because we're not skipping all
    swap entries - we handle device private/exclusive pages before that.

  - The skip behavior is enabled as long as zap_details pointer passed over.
    It's very hard to figure that out for a new zap caller because it's unclear
    why we should skip swap entries when we have zap_details specified.

  - With modern systems, especially performance critical use cases, swap
    entries should be rare, so I doubt the usefulness of this optimization
    since it should be on a slow path anyway.

  - It is not aligned with what we do with huge pmd swap entries, where in
    zap_huge_pmd() we'll do the accounting unconditionally.

This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
should do the same mapping check upon migration entries too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Alistair Popple Dec. 2, 2021, 11:06 a.m. UTC | #1
On Tuesday, 16 November 2021 12:49:50 AM AEDT Peter Xu wrote:
> This check existed since the 1st git commit of Linux repository, but at that
> time there's no page migration yet so I think it's okay.
> 
> With page migration enabled, it should logically be possible that we zap some
> shmem pages during migration.  When that happens, IIUC the old code could have
> the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> without decreasing the counters for the migrating entries.  I have no unit test
> to prove it as I don't know an easy way to trigger this condition, though.
> 
> Besides, the optimization itself is already confusing IMHO to me in a few points:

I've spent a bit of time looking at this and think it would be good to get
cleaned up as I've found it hard to follow in the past. What I haven't been
able to confirm is if anything relies on skipping swap entries or not. From
you're description it sounds like skipping swap entries was done as an
optimisation rather than for some functional reason is that correct?

>   - The wording "skip swap entries" is confusing, because we're not skipping all
>     swap entries - we handle device private/exclusive pages before that.
> 
>   - The skip behavior is enabled as long as zap_details pointer passed over.
>     It's very hard to figure that out for a new zap caller because it's unclear
>     why we should skip swap entries when we have zap_details specified.
> 
>   - With modern systems, especially performance critical use cases, swap
>     entries should be rare, so I doubt the usefulness of this optimization
>     since it should be on a slow path anyway.
> 
>   - It is not aligned with what we do with huge pmd swap entries, where in
>     zap_huge_pmd() we'll do the accounting unconditionally.
> 
> This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> should do the same mapping check upon migration entries too.

I agree, and I'm not convinced the current handling is very good - if we
skip zapping a migration entry then the page mapping might get restored when
the migration entry is removed.

In practice I don't think that is a problem as the migration entry target page
will be locked, and if I'm understanding things correctly callers of
unmap_mapping_*() need to have the page(s) locked anyway if they want to be
sure the page is unmapped. But it seems removing the migration entries better
matches the intent and I can't think of a reason why they should be skipped.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..e454f3c6aeb9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			continue;
>  		}
>  
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> -			continue;
> -
>  		if (!non_swap_entry(entry))
>  			rss[MM_SWAPENTS]--;
>  		else if (is_migration_entry(entry)) {
>  			struct page *page;
>  
>  			page = pfn_swap_entry_to_page(entry);
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;
>  			rss[mm_counter(page)]--;
>  		}
>  		if (unlikely(!free_swap_and_cache(entry)))
>
Peter Xu Dec. 3, 2021, 3:21 a.m. UTC | #2
On Thu, Dec 02, 2021 at 10:06:46PM +1100, Alistair Popple wrote:
> On Tuesday, 16 November 2021 12:49:50 AM AEDT Peter Xu wrote:
> > This check existed since the 1st git commit of Linux repository, but at that
> > time there's no page migration yet so I think it's okay.
> > 
> > With page migration enabled, it should logically be possible that we zap some
> > shmem pages during migration.  When that happens, IIUC the old code could have
> > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > without decreasing the counters for the migrating entries.  I have no unit test
> > to prove it as I don't know an easy way to trigger this condition, though.
> > 
> > Besides, the optimization itself is already confusing IMHO to me in a few points:
> 
> I've spent a bit of time looking at this and think it would be good to get
> cleaned up as I've found it hard to follow in the past. What I haven't been
> able to confirm is if anything relies on skipping swap entries or not. From
> you're description it sounds like skipping swap entries was done as an
> optimisation rather than for some functional reason is that correct?

Thanks again for looking into this patch, Alistair.  I appreciate it a lot.

I should say that it's how I understand this, and I could be wrong, that's the
major reason why I marked this patch as RFC.

As I mentioned this behavior existed in the 1st commit of git history of Linux,
that's the time when there's no special swap entries at all but all the swap
entries are "real" swap entries for anonymous.

That's why I think it should be an optimization because when previously
zap_details (along with zap_details->mapping in the old code) is non-null, and
that's definitely not an anonymous page.  Then skipping swap entry for file
backed memory sounds like a good optimization.

However after that we've got all kinds of swap entries introduced, and as you
spotted at least the migration entry should be able to exist to some file
backed memory type (shmem).

> 
> >   - The wording "skip swap entries" is confusing, because we're not skipping all
> >     swap entries - we handle device private/exclusive pages before that.
> > 
> >   - The skip behavior is enabled as long as zap_details pointer passed over.
> >     It's very hard to figure that out for a new zap caller because it's unclear
> >     why we should skip swap entries when we have zap_details specified.
> > 
> >   - With modern systems, especially performance critical use cases, swap
> >     entries should be rare, so I doubt the usefulness of this optimization
> >     since it should be on a slow path anyway.
> > 
> >   - It is not aligned with what we do with huge pmd swap entries, where in
> >     zap_huge_pmd() we'll do the accounting unconditionally.
> > 
> > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > should do the same mapping check upon migration entries too.
> 
> I agree, and I'm not convinced the current handling is very good - if we
> skip zapping a migration entry then the page mapping might get restored when
> the migration entry is removed.
> 
> In practice I don't think that is a problem as the migration entry target page
> will be locked, and if I'm understanding things correctly callers of
> unmap_mapping_*() need to have the page(s) locked anyway if they want to be
> sure the page is unmapped. But it seems removing the migration entries better
> matches the intent and I can't think of a reason why they should be skipped.

Exactly, that's what I see this too.

I used to think there is a bug for shmem migration (if you still remember I
mentioned it in some of my previous patchset cover letters), but then I found
migration requires page lock then it's probably not a real bug at all.  However
that's never a convincing reason to ignore swap entries.

I wanted to "ignore" this problem by the "adding a flag to skip swap entry"
patch, but as you saw it was very not welcomed anyway, so I have no choice to
try find the fundamental reason for skipping swap entries.  When I figured I
cannot really find any good reason and skipping seems to be even buggy, hence
this patch.  If this is the right way, the zap pte path can be simplified quite
a lot after patch 2 of this series.
Alistair Popple Dec. 3, 2021, 5:33 a.m. UTC | #3
On Friday, 3 December 2021 2:21:41 PM AEDT Peter Xu wrote:
> On Thu, Dec 02, 2021 at 10:06:46PM +1100, Alistair Popple wrote:
> > On Tuesday, 16 November 2021 12:49:50 AM AEDT Peter Xu wrote:
> > > This check existed since the 1st git commit of Linux repository, but at that
> > > time there's no page migration yet so I think it's okay.
> > > 
> > > With page migration enabled, it should logically be possible that we zap some
> > > shmem pages during migration.  When that happens, IIUC the old code could have
> > > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > > without decreasing the counters for the migrating entries.  I have no unit test
> > > to prove it as I don't know an easy way to trigger this condition, though.
> > > 
> > > Besides, the optimization itself is already confusing IMHO to me in a few points:
> > 
> > I've spent a bit of time looking at this and think it would be good to get
> > cleaned up as I've found it hard to follow in the past. What I haven't been
> > able to confirm is if anything relies on skipping swap entries or not. From
> > you're description it sounds like skipping swap entries was done as an
> > optimisation rather than for some functional reason is that correct?
> 
> Thanks again for looking into this patch, Alistair.  I appreciate it a lot.
> 
> I should say that it's how I understand this, and I could be wrong, that's the

That makes two of us!

> major reason why I marked this patch as RFC.
>
> As I mentioned this behavior existed in the 1st commit of git history of Linux,
> that's the time when there's no special swap entries at all but all the swap
> entries are "real" swap entries for anonymous.
> 
> That's why I think it should be an optimization because when previously
> zap_details (along with zap_details->mapping in the old code) is non-null, and
> that's definitely not an anonymous page.  Then skipping swap entry for file
> backed memory sounds like a good optimization.

Thanks. That was the detail I was trying to figure out. Ie. why might something
want to skip swap entries. I will spend some more time looking to be sure
though.

> However after that we've got all kinds of swap entries introduced, and as you
> spotted at least the migration entry should be able to exist to some file
> backed memory type (shmem).
> 
> > 
> > >   - The wording "skip swap entries" is confusing, because we're not skipping all
> > >     swap entries - we handle device private/exclusive pages before that.
> > > 
> > >   - The skip behavior is enabled as long as zap_details pointer passed over.
> > >     It's very hard to figure that out for a new zap caller because it's unclear
> > >     why we should skip swap entries when we have zap_details specified.
> > > 
> > >   - With modern systems, especially performance critical use cases, swap
> > >     entries should be rare, so I doubt the usefulness of this optimization
> > >     since it should be on a slow path anyway.
> > > 
> > >   - It is not aligned with what we do with huge pmd swap entries, where in
> > >     zap_huge_pmd() we'll do the accounting unconditionally.
> > > 
> > > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > > should do the same mapping check upon migration entries too.
> > 
> > I agree, and I'm not convinced the current handling is very good - if we
> > skip zapping a migration entry then the page mapping might get restored when
> > the migration entry is removed.
> > 
> > In practice I don't think that is a problem as the migration entry target page
> > will be locked, and if I'm understanding things correctly callers of
> > unmap_mapping_*() need to have the page(s) locked anyway if they want to be
> > sure the page is unmapped. But it seems removing the migration entries better
> > matches the intent and I can't think of a reason why they should be skipped.
> 
> Exactly, that's what I see this too.
> 
> I used to think there is a bug for shmem migration (if you still remember I
> mentioned it in some of my previous patchset cover letters), but then I found
> migration requires page lock then it's probably not a real bug at all.  However
> that's never a convincing reason to ignore swap entries.

Right, it also took me a while to convince myself there wasn't a bug there so
if for some reason this patch doesn't end up going in I think we should still
treat migration entries the same way as device-private entries.

> I wanted to "ignore" this problem by the "adding a flag to skip swap entry"
> patch, but as you saw it was very not welcomed anyway, so I have no choice to
> try find the fundamental reason for skipping swap entries.  When I figured I
> cannot really find any good reason and skipping seems to be even buggy, hence
> this patch.  If this is the right way, the zap pte path can be simplified quite
> a lot after patch 2 of this series.

Yep, I think it's definitely worth trying to figure out. And if it turns out
there is some good reason for skipping we better make sure to document it in a
comment somewhere so none of this good research is lost. However I haven't yet
come up with a reason why they need to be skipped either.

 - Alistair
Peter Xu Dec. 3, 2021, 6:59 a.m. UTC | #4
On Fri, Dec 03, 2021 at 04:33:43PM +1100, Alistair Popple wrote:
> Thanks. That was the detail I was trying to figure out. Ie. why might something
> want to skip swap entries. I will spend some more time looking to be sure
> though.

Sure, please take your time, and thanks again for working together on this.
Hugh Dickins Jan. 9, 2022, 1:19 a.m. UTC | #5
On Mon, 15 Nov 2021, Peter Xu wrote:

> This check existed since the 1st git commit of Linux repository, but at that
> time there's no page migration yet so I think it's okay.

//git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
helps with the history.  Yes, the check was okay back then,
but a lot of changes have come in since: I'll tell more of those below.

You are absolutely right to clean this up and fix the bugs that
have crept in, but I think the patch itself is not quite right yet.

> 
> With page migration enabled, it should logically be possible that we zap some
> shmem pages during migration.

Yes.

> When that happens, IIUC the old code could have
> the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> without decreasing the counters for the migrating entries.  I have no unit test
> to prove it as I don't know an easy way to trigger this condition, though.

In the no-details case, yes, it does look like that. I ought to try
and reproduce that, but responding to mails seems more important.

> 
> Besides, the optimization itself is already confusing IMHO to me in a few points:

It is confusing and unnecessary and wrong, I agree.

> 
>   - The wording "skip swap entries" is confusing, because we're not skipping all
>     swap entries - we handle device private/exclusive pages before that.

I'm entirely ignorant of device pages, so cannot comment on your 2/2,
but of course it's good if you can bring the cases closer together.

> 
>   - The skip behavior is enabled as long as zap_details pointer passed over.
>     It's very hard to figure that out for a new zap caller because it's unclear
>     why we should skip swap entries when we have zap_details specified.

History below will clarify that.

> 
>   - With modern systems, especially performance critical use cases, swap
>     entries should be rare, so I doubt the usefulness of this optimization
>     since it should be on a slow path anyway.
> 
>   - It is not aligned with what we do with huge pmd swap entries, where in
>     zap_huge_pmd() we'll do the accounting unconditionally.

The patch below does not align with what's done in zap_huge_pmd() either;
but I think zap_huge_pmd() is okay without "details" because its only swap
entries are migration entries, and we do not use huge pages when COWing
from file huge pages.

> 
> This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> should do the same mapping check upon migration entries too.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..e454f3c6aeb9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			continue;
>  		}
>  
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> -			continue;
> -

Good.

>  		if (!non_swap_entry(entry))
>  			rss[MM_SWAPENTS]--;
>  		else if (is_migration_entry(entry)) {
>  			struct page *page;
>  
>  			page = pfn_swap_entry_to_page(entry);
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;

Good; though I don't think it's worth an "unlikely", and I say again,
more forcefully, that "zap_skip_check_mapping" is a terrible name.

David suggested naming its inversion should_zap_page(), yes, that's
much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
but should_zap_page() for the inversion sounds good to me.

And I'm pleased to see in one of Matthew's diffs that he intends to
bring zap_details and zap_skip_check_mapping() back into mm/memory.c
from include/linux/mm.h.

>  			rss[mm_counter(page)]--;
>  		}
>  		if (unlikely(!free_swap_and_cache(entry)))
> -- 
> 2.32.0

History.  zap_details came in 2.6.6, and it was mostly to manage
truncation on the non-linear vmas we had at that time (remap_file_pages
syscall packing non-sequential offsets into a single vma, with pte_file
entries), where index could not be deduced from position of pte in vma:
truncation range had to be passed down in zap_details; and an madvise
needed it too, so it could not be private to mm/memory.c then.

But at the same time, I added the even_cows argument to
unmap_mapping_range(), to distinguish truncation (which POSIX requires
to unmap even COWed pages) from invalidation (for page cache coherence,
which shouldn't touch private COWs).  However, there appear to be no
users of that in 2.6.6, though I wouldn't have added that complication
just for the fun of confusing everyone: best guess would be that there
was parallel development, and the use for !even_cows got removed in
the very same release that it was being added.

(PageAnon was brand new in 2.6.6: maybe it could have been used instead
of comparing details->check_mapping, or maybe there's some other case
I've forgotten which actually needs the exact mapping check.)

Eventually a few drivers came to use unmap_shared_mapping_range(),
the !even_cows caller; but more to the point, hole-punching came in,
and I felt very strongly that hole-punching on a file had no right
to delete private user data.  So then !even_cows became useful.

IIRC, I've seen Linus say he *detests* zap_details.  It had much better
justification in the days of non-linear, and I was sad to add
single_page to it quite recently; but hope that can go away later
(when try_to_unmap_one() is properly extended to THPs).

Now, here's what may clear up a lot of the confusion.
The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
	if (details && !details->check_mapping && !details->nonlinear_vma)
		details = NULL;
which paired with the
		/*
		 * If details->check_mapping, we leave swap entries;
		 * if details->nonlinear_vma, we leave file entries.
		 */
		if (unlikely(details))
			continue;
lower down.  I haven't followed up the actual commits, but ChangeLog-2.6.7
implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
It was intended, not so much to optimize, as to simplify the flow;
but in fact has caused all this confusion.

When Kirill discontinued non-linear mapping support in 4.0 (no tears
were shed and) the nonlinear_vma comment above got deleted, leaving
just the then more puzzling check_mapping comment.

Then in 4.6 the "details = NULL;" seems to have got deleted as part of
aac453635549 ("mm, oom: introduce oom reaper"), which added some new
details (ignore_dirty and check_swap_entries); which got reverted in
4.11, but apparently the "details = NULL;" not restored.

My stamina is suddenly exhausted, without actually pinpointing a commit
for "Fixes:" in your eventual cleanup.  Sorry, I've had enough!

Hugh
David Hildenbrand Jan. 10, 2022, 8:37 a.m. UTC | #6
On 15.11.21 14:49, Peter Xu wrote:
> This check existed since the 1st git commit of Linux repository, but at that
> time there's no page migration yet so I think it's okay.
> 
> With page migration enabled, it should logically be possible that we zap some
> shmem pages during migration.  When that happens, IIUC the old code could have
> the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> without decreasing the counters for the migrating entries.  I have no unit test
> to prove it as I don't know an easy way to trigger this condition, though.
> 
> Besides, the optimization itself is already confusing IMHO to me in a few points:
> 
>   - The wording "skip swap entries" is confusing, because we're not skipping all
>     swap entries - we handle device private/exclusive pages before that.

I think one part of the confusion is "swap vs non-swap" entries.
For !pte_none() && !pte_present() we can have

* swap entry
* non-swap entry
** device exclusive entry
** device private entry
** HWpoison entry
** migration entry

So the comment claims to skip "swap entries" but also skips HWpoison and
migration entries, and I think that's the confusing part.
Both only apply to PageAnon().


IIUC, the only way we could get details != NULL is via unmap_mapping_page()+unmap_mapping_pages().

I do wonder if any of the callers really cares about PageAnon() pages where this would be relevant.

Am I wrong or is unmap_mapping_pages() never called with "even_cows == true" and we can remove
that paremeter:


git grep -C2 unmap_mapping_pages
fs/afs/callback.c-      struct afs_vnode *vnode = container_of(work, struct afs_vnode, cb_work);
fs/afs/callback.c-
fs/afs/callback.c:      unmap_mapping_pages(vnode->vfs_inode.i_mapping, 0, 0, false);
fs/afs/callback.c-}
fs/afs/callback.c-
--
fs/dax.c-               if (dax_is_zero_entry(entry)) {
fs/dax.c-                       xas_unlock_irq(xas);
fs/dax.c:                       unmap_mapping_pages(mapping,
fs/dax.c-                                       xas->xa_index & ~PG_PMD_COLOUR,
fs/dax.c-                                       PG_PMD_NR, false);
--
fs/dax.c-        * get_user_pages() slow path.  The slow path is protected by
fs/dax.c-        * pte_lock() and pmd_lock(). New references are not taken without
fs/dax.c:        * holding those locks, and unmap_mapping_pages() will not zero the
fs/dax.c-        * pte or pmd without holding the respective lock, so we are
fs/dax.c-        * guaranteed to either see new references or prevent new
fs/dax.c-        * references from being established.
fs/dax.c-        */
fs/dax.c:       unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
fs/dax.c-
fs/dax.c-       xas_lock_irq(&xas);
--
fs/dax.c-               /* we are replacing a zero page with block mapping */
fs/dax.c-               if (dax_is_pmd_entry(entry))
fs/dax.c:                       unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
fs/dax.c-                                       PG_PMD_NR, false);
fs/dax.c-               else /* pte entry */
fs/dax.c:                       unmap_mapping_pages(mapping, index, 1, false);
fs/dax.c-       }
fs/dax.c-
--
include/linux/mm.h-                         bool *unlocked);
include/linux/mm.h-void unmap_mapping_page(struct page *page);
include/linux/mm.h:void unmap_mapping_pages(struct address_space *mapping,
include/linux/mm.h-             pgoff_t start, pgoff_t nr, bool even_cows);
include/linux/mm.h-void unmap_mapping_range(struct address_space *mapping,
--
include/linux/mm.h-}
include/linux/mm.h-static inline void unmap_mapping_page(struct page *page) { }
include/linux/mm.h:static inline void unmap_mapping_pages(struct address_space *mapping,
include/linux/mm.h-             pgoff_t start, pgoff_t nr, bool even_cows) { }
include/linux/mm.h-static inline void unmap_mapping_range(struct address_space *mapping,
--
mm/khugepaged.c-
mm/khugepaged.c-                if (page_mapped(page))
mm/khugepaged.c:                        unmap_mapping_pages(mapping, index, 1, false);
mm/khugepaged.c-
mm/khugepaged.c-                xas_lock_irq(&xas);
--
mm/memory.c- * Unmap this page from any userspace process which still has it mmaped.
mm/memory.c- * Typically, for efficiency, the range of nearby pages has already been
mm/memory.c: * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
mm/memory.c- * truncation or invalidation holds the lock on a page, it may find that
mm/memory.c- * the page has been remapped again: and then uses unmap_mapping_page()
--
mm/memory.c-
mm/memory.c-/**
mm/memory.c: * unmap_mapping_pages() - Unmap pages from processes.
mm/memory.c- * @mapping: The address space containing pages to be unmapped.
mm/memory.c- * @start: Index of first page to be unmapped.
--
mm/memory.c- * cache.
mm/memory.c- */
mm/memory.c:void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
mm/memory.c-            pgoff_t nr, bool even_cows)
mm/memory.c-{
--
mm/memory.c-    i_mmap_unlock_write(mapping);
mm/memory.c-}
mm/memory.c:EXPORT_SYMBOL_GPL(unmap_mapping_pages);
mm/memory.c-
mm/memory.c-/**
--
mm/memory.c-    }
mm/memory.c-
mm/memory.c:    unmap_mapping_pages(mapping, hba, hlen, even_cows);
mm/memory.c-}
mm/memory.c-EXPORT_SYMBOL(unmap_mapping_range);
--
mm/truncate.c-                           * zap the rest of the file in one hit.
mm/truncate.c-                           */
mm/truncate.c:                          unmap_mapping_pages(mapping, index,
mm/truncate.c-                                          (1 + end - index), false);
mm/truncate.c-                          did_range_unmap = 1;
--
mm/truncate.c-   */
mm/truncate.c-  if (dax_mapping(mapping)) {
mm/truncate.c:          unmap_mapping_pages(mapping, start, end - start + 1, false);
mm/truncate.c-  }
mm/truncate.c-out:
Alistair Popple Jan. 11, 2022, 7:40 a.m. UTC | #7
On Monday, 10 January 2022 7:37:15 PM AEDT David Hildenbrand wrote:
> On 15.11.21 14:49, Peter Xu wrote:
> > This check existed since the 1st git commit of Linux repository, but at that
> > time there's no page migration yet so I think it's okay.
> > 
> > With page migration enabled, it should logically be possible that we zap some
> > shmem pages during migration.  When that happens, IIUC the old code could have
> > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > without decreasing the counters for the migrating entries.  I have no unit test
> > to prove it as I don't know an easy way to trigger this condition, though.
> > 
> > Besides, the optimization itself is already confusing IMHO to me in a few points:
> > 
> >   - The wording "skip swap entries" is confusing, because we're not skipping all
> >     swap entries - we handle device private/exclusive pages before that.
> 
> I think one part of the confusion is "swap vs non-swap" entries.
> For !pte_none() && !pte_present() we can have
> 
> * swap entry
> * non-swap entry
> ** device exclusive entry
> ** device private entry
> ** HWpoison entry
> ** migration entry
> 
> So the comment claims to skip "swap entries" but also skips HWpoison and
> migration entries, and I think that's the confusing part.
> Both only apply to PageAnon().

I must be missing something but why do these only apply to PageAnon()?

> IIUC, the only way we could get details != NULL is via unmap_mapping_page()+unmap_mapping_pages().
> 
> I do wonder if any of the callers really cares about PageAnon() pages where this would be relevant.
> 
> Am I wrong or is unmap_mapping_pages() never called with "even_cows == true" and we can remove
> that paremeter:

Except that unmap_mapping_range() takes `even_cows` as a parameter and passes
that to unmap_mapping_pages(), and from what I can tell there are callers of
unmap_mapping_range() that set `even_cows = true`.

> git grep -C2 unmap_mapping_pages
> fs/afs/callback.c-      struct afs_vnode *vnode = container_of(work, struct afs_vnode, cb_work);
> fs/afs/callback.c-
> fs/afs/callback.c:      unmap_mapping_pages(vnode->vfs_inode.i_mapping, 0, 0, false);
> fs/afs/callback.c-}
> fs/afs/callback.c-
> --
> fs/dax.c-               if (dax_is_zero_entry(entry)) {
> fs/dax.c-                       xas_unlock_irq(xas);
> fs/dax.c:                       unmap_mapping_pages(mapping,
> fs/dax.c-                                       xas->xa_index & ~PG_PMD_COLOUR,
> fs/dax.c-                                       PG_PMD_NR, false);
> --
> fs/dax.c-        * get_user_pages() slow path.  The slow path is protected by
> fs/dax.c-        * pte_lock() and pmd_lock(). New references are not taken without
> fs/dax.c:        * holding those locks, and unmap_mapping_pages() will not zero the
> fs/dax.c-        * pte or pmd without holding the respective lock, so we are
> fs/dax.c-        * guaranteed to either see new references or prevent new
> fs/dax.c-        * references from being established.
> fs/dax.c-        */
> fs/dax.c:       unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
> fs/dax.c-
> fs/dax.c-       xas_lock_irq(&xas);
> --
> fs/dax.c-               /* we are replacing a zero page with block mapping */
> fs/dax.c-               if (dax_is_pmd_entry(entry))
> fs/dax.c:                       unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
> fs/dax.c-                                       PG_PMD_NR, false);
> fs/dax.c-               else /* pte entry */
> fs/dax.c:                       unmap_mapping_pages(mapping, index, 1, false);
> fs/dax.c-       }
> fs/dax.c-
> --
> include/linux/mm.h-                         bool *unlocked);
> include/linux/mm.h-void unmap_mapping_page(struct page *page);
> include/linux/mm.h:void unmap_mapping_pages(struct address_space *mapping,
> include/linux/mm.h-             pgoff_t start, pgoff_t nr, bool even_cows);
> include/linux/mm.h-void unmap_mapping_range(struct address_space *mapping,
> --
> include/linux/mm.h-}
> include/linux/mm.h-static inline void unmap_mapping_page(struct page *page) { }
> include/linux/mm.h:static inline void unmap_mapping_pages(struct address_space *mapping,
> include/linux/mm.h-             pgoff_t start, pgoff_t nr, bool even_cows) { }
> include/linux/mm.h-static inline void unmap_mapping_range(struct address_space *mapping,
> --
> mm/khugepaged.c-
> mm/khugepaged.c-                if (page_mapped(page))
> mm/khugepaged.c:                        unmap_mapping_pages(mapping, index, 1, false);
> mm/khugepaged.c-
> mm/khugepaged.c-                xas_lock_irq(&xas);
> --
> mm/memory.c- * Unmap this page from any userspace process which still has it mmaped.
> mm/memory.c- * Typically, for efficiency, the range of nearby pages has already been
> mm/memory.c: * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
> mm/memory.c- * truncation or invalidation holds the lock on a page, it may find that
> mm/memory.c- * the page has been remapped again: and then uses unmap_mapping_page()
> --
> mm/memory.c-
> mm/memory.c-/**
> mm/memory.c: * unmap_mapping_pages() - Unmap pages from processes.
> mm/memory.c- * @mapping: The address space containing pages to be unmapped.
> mm/memory.c- * @start: Index of first page to be unmapped.
> --
> mm/memory.c- * cache.
> mm/memory.c- */
> mm/memory.c:void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> mm/memory.c-            pgoff_t nr, bool even_cows)
> mm/memory.c-{
> --
> mm/memory.c-    i_mmap_unlock_write(mapping);
> mm/memory.c-}
> mm/memory.c:EXPORT_SYMBOL_GPL(unmap_mapping_pages);
> mm/memory.c-
> mm/memory.c-/**
> --
> mm/memory.c-    }
> mm/memory.c-
> mm/memory.c:    unmap_mapping_pages(mapping, hba, hlen, even_cows);
> mm/memory.c-}
> mm/memory.c-EXPORT_SYMBOL(unmap_mapping_range);
> --
> mm/truncate.c-                           * zap the rest of the file in one hit.
> mm/truncate.c-                           */
> mm/truncate.c:                          unmap_mapping_pages(mapping, index,
> mm/truncate.c-                                          (1 + end - index), false);
> mm/truncate.c-                          did_range_unmap = 1;
> --
> mm/truncate.c-   */
> mm/truncate.c-  if (dax_mapping(mapping)) {
> mm/truncate.c:          unmap_mapping_pages(mapping, start, end - start + 1, false);
> mm/truncate.c-  }
> mm/truncate.c-out:
> 
> 
>
David Hildenbrand Jan. 11, 2022, 9 a.m. UTC | #8
On 11.01.22 08:40, Alistair Popple wrote:
> On Monday, 10 January 2022 7:37:15 PM AEDT David Hildenbrand wrote:
>> On 15.11.21 14:49, Peter Xu wrote:
>>> This check existed since the 1st git commit of Linux repository, but at that
>>> time there's no page migration yet so I think it's okay.
>>>
>>> With page migration enabled, it should logically be possible that we zap some
>>> shmem pages during migration.  When that happens, IIUC the old code could have
>>> the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
>>> without decreasing the counters for the migrating entries.  I have no unit test
>>> to prove it as I don't know an easy way to trigger this condition, though.
>>>
>>> Besides, the optimization itself is already confusing IMHO to me in a few points:
>>>
>>>   - The wording "skip swap entries" is confusing, because we're not skipping all
>>>     swap entries - we handle device private/exclusive pages before that.
>>
>> I think one part of the confusion is "swap vs non-swap" entries.
>> For !pte_none() && !pte_present() we can have
>>
>> * swap entry
>> * non-swap entry
>> ** device exclusive entry
>> ** device private entry
>> ** HWpoison entry
>> ** migration entry
>>
>> So the comment claims to skip "swap entries" but also skips HWpoison and
>> migration entries, and I think that's the confusing part.
>> Both only apply to PageAnon().
> 
> I must be missing something but why do these only apply to PageAnon()?

My memory might be wrong. I remember that for PageAnon() we need
migration/hwpoison entries because there is no way we could refault the
page from a mapping once we zap the entry. For everything else, we could
zap and refault. But looks like we indeed also use migration/hwpoison
entries for pages with a mapping, although it might not be strictly
required.

> 
>> IIUC, the only way we could get details != NULL is via unmap_mapping_page()+unmap_mapping_pages().
>>
>> I do wonder if any of the callers really cares about PageAnon() pages where this would be relevant.
>>
>> Am I wrong or is unmap_mapping_pages() never called with "even_cows == true" and we can remove
>> that paremeter:
> 
> Except that unmap_mapping_range() takes `even_cows` as a parameter and passes
> that to unmap_mapping_pages(), and from what I can tell there are callers of
> unmap_mapping_range() that set `even_cows = true`.

You're right.
Peter Xu Jan. 12, 2022, 1:18 p.m. UTC | #9
On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> On Mon, 15 Nov 2021, Peter Xu wrote:
> 
> > This check existed since the 1st git commit of Linux repository, but at that
> > time there's no page migration yet so I think it's okay.
> 
> //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> helps with the history.  Yes, the check was okay back then,
> but a lot of changes have come in since: I'll tell more of those below.

Thanks for looking at this.  By the way, the link is greatly helpful. It's
always good to be able to read into the history.

> 
> You are absolutely right to clean this up and fix the bugs that
> have crept in, but I think the patch itself is not quite right yet.

Do you mean the pmd path on checking mapping?  If so I agree, and I'll add that
in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
analyzed below).

Let me know if I missed anything else..

> 
> > 
> > With page migration enabled, it should logically be possible that we zap some
> > shmem pages during migration.
> 
> Yes.
> 
> > When that happens, IIUC the old code could have
> > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > without decreasing the counters for the migrating entries.  I have no unit test
> > to prove it as I don't know an easy way to trigger this condition, though.
> 
> In the no-details case, yes, it does look like that. I ought to try
> and reproduce that, but responding to mails seems more important.

Please let me know if you know how to reproduce it, since I don't know yet a
real reproducer.

What I can do, though, is if with below patch applied to current linux master:

=============
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..51fe02a22ea9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
                }
 
                /* If details->check_mapping, we leave swap entries. */
-               if (unlikely(details))
+               if (unlikely(details)) {
+                       WARN_ON_ONCE(is_migration_entry(entry));
                        continue;
+               }
 
                if (!non_swap_entry(entry))
                        rss[MM_SWAPENTS]--;
=============

Then I can easily trigger it if with the help of a test program attached
(zap-migrate.c).

IMHO it won't really reproduce because it seems all relevant shmem operations
(e.g. hole punch, unmap..) will take the page lock so it won't really race with
migration (which needs the page lock too), as mentioned in previous cover
letters when I was still working on the old versions of this.  But it could be
possible that I missed something.

While the WARN_ON_ONCE() can trigger only because of the fast path in hole
punching we have the pre-unmap without lock:

		if ((u64)unmap_end > (u64)unmap_start)
			unmap_mapping_range(mapping, unmap_start,
					    1 + unmap_end - unmap_start, 0);
		shmem_truncate_range(inode, offset, offset + len - 1);

But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
which is with-lock.  Hence we have a small window to at least trigger the
warning above, even if it won't really mess up the accounting finally.

> 
> > 
> > Besides, the optimization itself is already confusing IMHO to me in a few points:
> 
> It is confusing and unnecessary and wrong, I agree.
> 
> > 
> >   - The wording "skip swap entries" is confusing, because we're not skipping all
> >     swap entries - we handle device private/exclusive pages before that.
> 
> I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> but of course it's good if you can bring the cases closer together.
> 
> > 
> >   - The skip behavior is enabled as long as zap_details pointer passed over.
> >     It's very hard to figure that out for a new zap caller because it's unclear
> >     why we should skip swap entries when we have zap_details specified.
> 
> History below will clarify that.
> 
> > 
> >   - With modern systems, especially performance critical use cases, swap
> >     entries should be rare, so I doubt the usefulness of this optimization
> >     since it should be on a slow path anyway.
> > 
> >   - It is not aligned with what we do with huge pmd swap entries, where in
> >     zap_huge_pmd() we'll do the accounting unconditionally.
> 
> The patch below does not align with what's done in zap_huge_pmd() either;
> but I think zap_huge_pmd() is okay without "details" because its only swap
> entries are migration entries, and we do not use huge pages when COWing
> from file huge pages.
> 
> > 
> > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > should do the same mapping check upon migration entries too.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/memory.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8f1de811a1dc..e454f3c6aeb9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			continue;
> >  		}
> >  
> > -		/* If details->check_mapping, we leave swap entries. */
> > -		if (unlikely(details))
> > -			continue;
> > -
> 
> Good.
> 
> >  		if (!non_swap_entry(entry))
> >  			rss[MM_SWAPENTS]--;
> >  		else if (is_migration_entry(entry)) {
> >  			struct page *page;
> >  
> >  			page = pfn_swap_entry_to_page(entry);
> > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > +				continue;
> 
> Good; though I don't think it's worth an "unlikely", and I say again,

Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
"unlikely" too around this check?

> more forcefully, that "zap_skip_check_mapping" is a terrible name.
> 
> David suggested naming its inversion should_zap_page(), yes, that's
> much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> but should_zap_page() for the inversion sounds good to me.

Ah sure, I'll rename it to should_zap_page() in a new patch before this.

> 
> And I'm pleased to see in one of Matthew's diffs that he intends to
> bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> from include/linux/mm.h.

Yeah it's only used in memory.c.  I put it there initially because I wanted
zap_details user can reference what's that check mapping is all about, but
maybe that's not necessary.

If you or Matthew could provide a link to that patch, I'll be happy to pick
that up too with this series.  Or I can also do nothing assuming Matthew will
handle it elsewhere.

> 
> >  			rss[mm_counter(page)]--;
> >  		}
> >  		if (unlikely(!free_swap_and_cache(entry)))
> > -- 
> > 2.32.0
> 
> History.  zap_details came in 2.6.6, and it was mostly to manage
> truncation on the non-linear vmas we had at that time (remap_file_pages
> syscall packing non-sequential offsets into a single vma, with pte_file
> entries), where index could not be deduced from position of pte in vma:
> truncation range had to be passed down in zap_details; and an madvise
> needed it too, so it could not be private to mm/memory.c then.
> 
> But at the same time, I added the even_cows argument to
> unmap_mapping_range(), to distinguish truncation (which POSIX requires
> to unmap even COWed pages) from invalidation (for page cache coherence,
> which shouldn't touch private COWs).  However, there appear to be no
> users of that in 2.6.6, though I wouldn't have added that complication
> just for the fun of confusing everyone: best guess would be that there
> was parallel development, and the use for !even_cows got removed in
> the very same release that it was being added.
> 
> (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> of comparing details->check_mapping, or maybe there's some other case
> I've forgotten which actually needs the exact mapping check.)
> 
> Eventually a few drivers came to use unmap_shared_mapping_range(),
> the !even_cows caller; but more to the point, hole-punching came in,
> and I felt very strongly that hole-punching on a file had no right
> to delete private user data.  So then !even_cows became useful.
> 
> IIRC, I've seen Linus say he *detests* zap_details.  It had much better
> justification in the days of non-linear, and I was sad to add
> single_page to it quite recently; but hope that can go away later
> (when try_to_unmap_one() is properly extended to THPs).
> 
> Now, here's what may clear up a lot of the confusion.
> The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> 	if (details && !details->check_mapping && !details->nonlinear_vma)
> 		details = NULL;
> which paired with the
> 		/*
> 		 * If details->check_mapping, we leave swap entries;
> 		 * if details->nonlinear_vma, we leave file entries.
> 		 */
> 		if (unlikely(details))
> 			continue;
> lower down.  I haven't followed up the actual commits, but ChangeLog-2.6.7
> implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> It was intended, not so much to optimize, as to simplify the flow;
> but in fact has caused all this confusion.
> 
> When Kirill discontinued non-linear mapping support in 4.0 (no tears
> were shed and) the nonlinear_vma comment above got deleted, leaving
> just the then more puzzling check_mapping comment.
> 
> Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> details (ignore_dirty and check_swap_entries); which got reverted in
> 4.11, but apparently the "details = NULL;" not restored.
> 
> My stamina is suddenly exhausted, without actually pinpointing a commit
> for "Fixes:" in your eventual cleanup.  Sorry, I've had enough!

Yeah it's in most cases a pain for digging all these trivial details, thanks
for digging already most of it out of the mist.

That's really what I hope this patch can help: not only because the uffd work
will rely on it, but also on resolving this early (we do use some wordings like
"technical debt" sometimes, I think it's the same here but different form) when
the above "history.git" is still functional so we can still reference.

With your help and the history.git I can try a better commit message because
obviously some of the contents needs amending (it's not a pure optimization at
all), but I assume the patch content will be mostly the same, with the tweaks
applied.

Per stated so far I don't know any real reproducer so maybe it's not a real
issue in any production system?  Maybe that'll make it a bit easier, because
then we don't strongly require a Fixes (which could be another hard question to
answer besides the issue itself).

Thanks,
Peter Xu Jan. 12, 2022, 1:26 p.m. UTC | #10
On Wed, Jan 12, 2022 at 09:18:09PM +0800, Peter Xu wrote:
> On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> > On Mon, 15 Nov 2021, Peter Xu wrote:
> > 
> > > This check existed since the 1st git commit of Linux repository, but at that
> > > time there's no page migration yet so I think it's okay.
> > 
> > //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> > helps with the history.  Yes, the check was okay back then,
> > but a lot of changes have come in since: I'll tell more of those below.
> 
> Thanks for looking at this.  By the way, the link is greatly helpful. It's
> always good to be able to read into the history.
> 
> > 
> > You are absolutely right to clean this up and fix the bugs that
> > have crept in, but I think the patch itself is not quite right yet.
> 
> Do you mean the pmd path on checking mapping?  If so I agree, and I'll add that
> in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
> analyzed below).
> 
> Let me know if I missed anything else..
> 
> > 
> > > 
> > > With page migration enabled, it should logically be possible that we zap some
> > > shmem pages during migration.
> > 
> > Yes.
> > 
> > > When that happens, IIUC the old code could have
> > > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > > without decreasing the counters for the migrating entries.  I have no unit test
> > > to prove it as I don't know an easy way to trigger this condition, though.
> > 
> > In the no-details case, yes, it does look like that. I ought to try
> > and reproduce that, but responding to mails seems more important.
> 
> Please let me know if you know how to reproduce it, since I don't know yet a
> real reproducer.
> 
> What I can do, though, is if with below patch applied to current linux master:
> 
> =============
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..51fe02a22ea9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 }
>  
>                 /* If details->check_mapping, we leave swap entries. */
> -               if (unlikely(details))
> +               if (unlikely(details)) {
> +                       WARN_ON_ONCE(is_migration_entry(entry));
>                         continue;
> +               }
>  
>                 if (!non_swap_entry(entry))
>                         rss[MM_SWAPENTS]--;
> =============
> 
> Then I can easily trigger it if with the help of a test program attached
> (zap-migrate.c).

(Attaching for real; also copy Matthew)

> 
> IMHO it won't really reproduce because it seems all relevant shmem operations
> (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> migration (which needs the page lock too), as mentioned in previous cover
> letters when I was still working on the old versions of this.  But it could be
> possible that I missed something.
> 
> While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> punching we have the pre-unmap without lock:
> 
> 		if ((u64)unmap_end > (u64)unmap_start)
> 			unmap_mapping_range(mapping, unmap_start,
> 					    1 + unmap_end - unmap_start, 0);
> 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> which is with-lock.  Hence we have a small window to at least trigger the
> warning above, even if it won't really mess up the accounting finally.
> 
> > 
> > > 
> > > Besides, the optimization itself is already confusing IMHO to me in a few points:
> > 
> > It is confusing and unnecessary and wrong, I agree.
> > 
> > > 
> > >   - The wording "skip swap entries" is confusing, because we're not skipping all
> > >     swap entries - we handle device private/exclusive pages before that.
> > 
> > I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> > but of course it's good if you can bring the cases closer together.
> > 
> > > 
> > >   - The skip behavior is enabled as long as zap_details pointer passed over.
> > >     It's very hard to figure that out for a new zap caller because it's unclear
> > >     why we should skip swap entries when we have zap_details specified.
> > 
> > History below will clarify that.
> > 
> > > 
> > >   - With modern systems, especially performance critical use cases, swap
> > >     entries should be rare, so I doubt the usefulness of this optimization
> > >     since it should be on a slow path anyway.
> > > 
> > >   - It is not aligned with what we do with huge pmd swap entries, where in
> > >     zap_huge_pmd() we'll do the accounting unconditionally.
> > 
> > The patch below does not align with what's done in zap_huge_pmd() either;
> > but I think zap_huge_pmd() is okay without "details" because its only swap
> > entries are migration entries, and we do not use huge pages when COWing
> > from file huge pages.
> > 
> > > 
> > > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > > should do the same mapping check upon migration entries too.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  mm/memory.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f1de811a1dc..e454f3c6aeb9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > >  			continue;
> > >  		}
> > >  
> > > -		/* If details->check_mapping, we leave swap entries. */
> > > -		if (unlikely(details))
> > > -			continue;
> > > -
> > 
> > Good.
> > 
> > >  		if (!non_swap_entry(entry))
> > >  			rss[MM_SWAPENTS]--;
> > >  		else if (is_migration_entry(entry)) {
> > >  			struct page *page;
> > >  
> > >  			page = pfn_swap_entry_to_page(entry);
> > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > +				continue;
> > 
> > Good; though I don't think it's worth an "unlikely", and I say again,
> 
> Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> "unlikely" too around this check?
> 
> > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > 
> > David suggested naming its inversion should_zap_page(), yes, that's
> > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > but should_zap_page() for the inversion sounds good to me.
> 
> Ah sure, I'll rename it to should_zap_page() in a new patch before this.
> 
> > 
> > And I'm pleased to see in one of Matthew's diffs that he intends to
> > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > from include/linux/mm.h.
> 
> Yeah it's only used in memory.c.  I put it there initially because I wanted
> zap_details user can reference what's that check mapping is all about, but
> maybe that's not necessary.
> 
> If you or Matthew could provide a link to that patch, I'll be happy to pick
> that up too with this series.  Or I can also do nothing assuming Matthew will
> handle it elsewhere.
> 
> > 
> > >  			rss[mm_counter(page)]--;
> > >  		}
> > >  		if (unlikely(!free_swap_and_cache(entry)))
> > > -- 
> > > 2.32.0
> > 
> > History.  zap_details came in 2.6.6, and it was mostly to manage
> > truncation on the non-linear vmas we had at that time (remap_file_pages
> > syscall packing non-sequential offsets into a single vma, with pte_file
> > entries), where index could not be deduced from position of pte in vma:
> > truncation range had to be passed down in zap_details; and an madvise
> > needed it too, so it could not be private to mm/memory.c then.
> > 
> > But at the same time, I added the even_cows argument to
> > unmap_mapping_range(), to distinguish truncation (which POSIX requires
> > to unmap even COWed pages) from invalidation (for page cache coherence,
> > which shouldn't touch private COWs).  However, there appear to be no
> > users of that in 2.6.6, though I wouldn't have added that complication
> > just for the fun of confusing everyone: best guess would be that there
> > was parallel development, and the use for !even_cows got removed in
> > the very same release that it was being added.
> > 
> > (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> > of comparing details->check_mapping, or maybe there's some other case
> > I've forgotten which actually needs the exact mapping check.)
> > 
> > Eventually a few drivers came to use unmap_shared_mapping_range(),
> > the !even_cows caller; but more to the point, hole-punching came in,
> > and I felt very strongly that hole-punching on a file had no right
> > to delete private user data.  So then !even_cows became useful.
> > 
> > IIRC, I've seen Linus say he *detests* zap_details.  It had much better
> > justification in the days of non-linear, and I was sad to add
> > single_page to it quite recently; but hope that can go away later
> > (when try_to_unmap_one() is properly extended to THPs).
> > 
> > Now, here's what may clear up a lot of the confusion.
> > The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> > 	if (details && !details->check_mapping && !details->nonlinear_vma)
> > 		details = NULL;
> > which paired with the
> > 		/*
> > 		 * If details->check_mapping, we leave swap entries;
> > 		 * if details->nonlinear_vma, we leave file entries.
> > 		 */
> > 		if (unlikely(details))
> > 			continue;
> > lower down.  I haven't followed up the actual commits, but ChangeLog-2.6.7
> > implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> > In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> > It was intended, not so much to optimize, as to simplify the flow;
> > but in fact has caused all this confusion.
> > 
> > When Kirill discontinued non-linear mapping support in 4.0 (no tears
> > were shed and) the nonlinear_vma comment above got deleted, leaving
> > just the then more puzzling check_mapping comment.
> > 
> > Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> > aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> > details (ignore_dirty and check_swap_entries); which got reverted in
> > 4.11, but apparently the "details = NULL;" not restored.
> > 
> > My stamina is suddenly exhausted, without actually pinpointing a commit
> > for "Fixes:" in your eventual cleanup.  Sorry, I've had enough!
> 
> Yeah it's in most cases a pain for digging all these trivial details, thanks
> for digging already most of it out of the mist.
> 
> That's really what I hope this patch can help: not only because the uffd work
> will rely on it, but also on resolving this early (we do use some wordings like
> "technical debt" sometimes, I think it's the same here but different form) when
> the above "history.git" is still functional so we can still reference.
> 
> With your help and the history.git I can try a better commit message because
> obviously some of the contents needs amending (it's not a pure optimization at
> all), but I assume the patch content will be mostly the same, with the tweaks
> applied.
> 
> Per stated so far I don't know any real reproducer so maybe it's not a real
> issue in any production system?  Maybe that'll make it a bit easier, because
> then we don't strongly require a Fixes (which could be another hard question to
> answer besides the issue itself).
> 
> Thanks,
> 
> -- 
> Peter Xu
Hugh Dickins Jan. 13, 2022, 3:47 a.m. UTC | #11
On Wed, 12 Jan 2022, Peter Xu wrote:
> On Wed, Jan 12, 2022 at 09:18:09PM +0800, Peter Xu wrote:
> > On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> > > On Mon, 15 Nov 2021, Peter Xu wrote:
> > > 
> > > You are absolutely right to clean this up and fix the bugs that
> > > have crept in, but I think the patch itself is not quite right yet.
> > 
> > Do you mean the pmd path on checking mapping?

Sorry, no, I don't mean that at all. We agree that we cannot see an actual
bug there, so nothing to fix; and re-exporting all the zap_details stuff
to make it available to mm/huge_memory.c would be rather sad.  You might
wish to do some future-proofing, but let's just leave that to the future.

I mean that the handling of swap-like entries in zap_pte_range(), after
you've removed the "if (unlikely(details)) continue;", is not right yet.

> > If so I agree, and I'll add that
> > in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
> > analyzed below).
> > 
> > Let me know if I missed anything else..

> > What I can do, though, is if with below patch applied to current linux master:
> > 
> > =============
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8f1de811a1dc..51fe02a22ea9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                 }
> >  
> >                 /* If details->check_mapping, we leave swap entries. */
> > -               if (unlikely(details))
> > +               if (unlikely(details)) {
> > +                       WARN_ON_ONCE(is_migration_entry(entry));
> >                         continue;
> > +               }
> >  
> >                 if (!non_swap_entry(entry))
> >                         rss[MM_SWAPENTS]--;
> > =============
> > 
> > Then I can easily trigger it if with the help of a test program attached
> > (zap-migrate.c).
> 
> (Attaching for real; also copy Matthew)

Yes, I'm not at all surprised that you can see migration entries there.

> 
> > 
> > IMHO it won't really reproduce because it seems all relevant shmem operations
> > (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> > migration (which needs the page lock too), as mentioned in previous cover
> > letters when I was still working on the old versions of this.  But it could be
> > possible that I missed something.

The latter.

There may still be some cases in which unmap_mapping_range() is called
on a single page with that page held locked, I have not checked; but in
general it is called on a range of pages, and would not have them locked.

The usual pattern when truncating/hole-punching/invalidating is to call
unmap_mapping_range() first, to do the bulk of the work efficiently,
scanning down the page tables without holding any page lock; but then
unmap racily faulted stragglers one by one while holding page lock
using unmap_mapping_page() just before deleting each from page cache.

So some relevant operations are done without page lock and some relevant
operations are done with page lock: both need to be considered, but you
are right that the page-locked ones will not encounter a migration entry
of that page (I haven't tried to see if that fact is useful here or not).

You may need to substitute "pages" or "page" or "folio" in the above.

> > 
> > While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> > punching we have the pre-unmap without lock:
> > 
> > 		if ((u64)unmap_end > (u64)unmap_start)
> > 			unmap_mapping_range(mapping, unmap_start,
> > 					    1 + unmap_end - unmap_start, 0);
> > 		shmem_truncate_range(inode, offset, offset + len - 1);

Oh, I should have read on, I've been wasting my time explaining above:
there you said the page must be locked, here you find that is not so.

> > 
> > But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> > which is with-lock.  Hence we have a small window to at least trigger the
> > warning above, even if it won't really mess up the accounting finally.

Well, there is still an opportunity to do the wrong thing.

> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > >  			continue;
> > > >  		}
> > > >  
> > > > -		/* If details->check_mapping, we leave swap entries. */
> > > > -		if (unlikely(details))
> > > > -			continue;
> > > > -
> > > 
> > > Good.
> > > 
> > > >  		if (!non_swap_entry(entry))
> > > >  			rss[MM_SWAPENTS]--;

Right, I did not say at all what is wrong here, I suppose I intended
to come back but then got tired - sorry.  Here we are dealing with a
genuine swap entry.  No problem for exit or munmap or MADV_DONTNEED.
No problem for truncation (but apparently there has been an unnoticed
problem, that swapped-out COWed pages beyond the truncation point were
not being deleted, while that "details continue" was still there).  But
a problem for hole-punching: without the "details continue", this will
go on to delete COWed pages which ought to be preserved - needs a
should_zap_page() check, as you do with the migration entry.

Except that here we have no page to check, so it looks like you'll
have to change should_zap_page() to deal with this case too, or just
check details->check_mapping directly.  Which raises the question again
of why I did not just use a boolean flag there originally: aah, I think
I've found why.  In those days there was a horrible "optimization", for
better performance on some benchmark I guess, which when you read from
/dev/zero into a private mapping, would map the zero page there (look
up read_zero_pagealigned() and zeromap_page_range() if you dare).  So
there was another category of page to be skipped along with the anon
COWs, and I didn't want multiple tests in the zap loop, so checking
check_mapping against page->mapping did both.  I think nowadays you
could do it by checking for PageAnon page (or genuine swap entry)
instead.

> > > >  		else if (is_migration_entry(entry)) {
> > > >  			struct page *page;
> > > >  
> > > >  			page = pfn_swap_entry_to_page(entry);
> > > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > > +				continue;
> > > 
> > > Good; though I don't think it's worth an "unlikely", and I say again,
> > 
> > Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> > "unlikely" too around this check?
> > 
> > > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > > 
> > > David suggested naming its inversion should_zap_page(), yes, that's
> > > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > > but should_zap_page() for the inversion sounds good to me.
> > 
> > Ah sure, I'll rename it to should_zap_page() in a new patch before this.

Thanks; though now even that name is looking dubious - should_zap_entry()?

> > 
> > > 
> > > And I'm pleased to see in one of Matthew's diffs that he intends to
> > > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > > from include/linux/mm.h.
> > 
> > Yeah it's only used in memory.c.  I put it there initially because I wanted
> > zap_details user can reference what's that check mapping is all about, but
> > maybe that's not necessary.
> > 
> > If you or Matthew could provide a link to that patch, I'll be happy to pick
> > that up too with this series.  Or I can also do nothing assuming Matthew will
> > handle it elsewhere.

In Linus's tree today: 3506659e18a6 ("mm: Add unmap_mapping_folio()").

> > 
> > > 
> > > >  			rss[mm_counter(page)]--;
> > > >  		}

I have given no thought as to whether more "else"s are needed there.

> > > >  		if (unlikely(!free_swap_and_cache(entry)))

(I think you moved that to a more relevant place in a later patch: good.)

I know you're also asking for test suggestions, but I'll leave it to you:
hole-punching, truncation, swapping, compaction (to force migration).

Hugh
Peter Xu Jan. 20, 2022, 10:32 a.m. UTC | #12
Hi, Hugh,

On Wed, Jan 12, 2022 at 07:47:40PM -0800, Hugh Dickins wrote:
> > > IMHO it won't really reproduce because it seems all relevant shmem operations
> > > (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> > > migration (which needs the page lock too), as mentioned in previous cover
> > > letters when I was still working on the old versions of this.  But it could be
> > > possible that I missed something.
> 
> The latter.
> 
> There may still be some cases in which unmap_mapping_range() is called
> on a single page with that page held locked, I have not checked; but in
> general it is called on a range of pages, and would not have them locked.
> 
> The usual pattern when truncating/hole-punching/invalidating is to call
> unmap_mapping_range() first, to do the bulk of the work efficiently,
> scanning down the page tables without holding any page lock; but then
> unmap racily faulted stragglers one by one while holding page lock
> using unmap_mapping_page() just before deleting each from page cache.
> 
> So some relevant operations are done without page lock and some relevant
> operations are done with page lock: both need to be considered, but you
> are right that the page-locked ones will not encounter a migration entry
> of that page (I haven't tried to see if that fact is useful here or not).
> 
> You may need to substitute "pages" or "page" or "folio" in the above.
> 
> > > 
> > > While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> > > punching we have the pre-unmap without lock:
> > > 
> > > 		if ((u64)unmap_end > (u64)unmap_start)
> > > 			unmap_mapping_range(mapping, unmap_start,
> > > 					    1 + unmap_end - unmap_start, 0);
> > > 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> Oh, I should have read on, I've been wasting my time explaining above:
> there you said the page must be locked, here you find that is not so.
> 
> > > 
> > > But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> > > which is with-lock.  Hence we have a small window to at least trigger the
> > > warning above, even if it won't really mess up the accounting finally.
> 
> Well, there is still an opportunity to do the wrong thing.

So what I wanted to express is we can't trigger any real issue, so no way to
write a valid reproducer even if we know it did it wrong..  And that's why I
provided a "fake" one..

Let me explain a little bit further.  Sorry I know you know most of below, so
it could be that you spend 5 minutes reading nonsense..  but I want to 100%
sure we're on the same page.

Firstly, current code will have a problem if we:

  (1) Passed over a valid "zap_details" pointer to zap_pte_range, and,
  (2) At the meantime if there can be a swap entry of any kind of below:
        (2.1) real swap entry
        (2.2) migration swap entry

Then we got doomed because we could have errornously skipped some swap entry
that was also under the mapping while we shouldn't.

I didn't mention device exclusive/private entries are fine because they're
checked before the "details continue" check.

I didn't mention hwpoison entry too; but I'll discuss it later..

Then, when do we have cases above (1) or (2)?

(1) is impossible because when there's zap_details in current code it means the
page is backing a fs mapping, hence not PageAnon.  We can only hit (2).

What can hit (2)?  I only know shmem...

Next, as a conclusion, what we want to reproduce is: some caller calls
zap_pte_range() with zap_details, but accidentally there's a migration entry.

It's fairly easy to find this, as what my "fake" reproducer was doing.

However, as I stated before, all these use cases always have another step to
take the lock and redo the range.  Then even if some migration entry got
wrongly skipped it'll always be fixed.  What we need to find is some caller
that calls zap_pte_range() without later taking the page lock and redo that.
That's the only possibility to trigger a real issue on the shmem accounting.

To prove this, let's see all the possible call stacks...

zap_pte_range() will only be called with zap_details pointer with below two
functions: 

        unmap_mapping_folio
        unmap_mapping_pages

unmap_mapping_folio is already having the page lock so it's not going to work.
Then the callers of unmap_mapping_pages are:

    unmap_mapping_pages
        [1] collapse_file
        [?] unmap_mapping_range
                [?] unmap_shared_mapping_range
                        [?] vmw_bo_dirty_unmap (gpu/drm)
                        [?] inode_go_sync (gfs2)
                [1] shmem_setattr
                [1] shmem_fallocate
                [1] truncate_pagecache
                [1] truncate_pagecache_range
        [1] invalidate_inode_pages2_range

It's a reversed tree of call stacks, when I marked [1] means this call site
does the "unmap then take page lock and redo" thing, so it can't reproduce.

The only outliers are:

        [?] vmw_bo_dirty_unmap (gpu/drm)
        [?] inode_go_sync (gfs2)

Which I marked with [?].

However I don't think they'll have migration entries at all, or am I wrong?

I hope I explained why I think we can't write a reproducer.. but I still think
we should fix the problem, because we can't tell whether it'll go wrong in the
future with a new caller who will not take the page lock to redo the zap.

> 
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > >  			continue;
> > > > >  		}
> > > > >  
> > > > > -		/* If details->check_mapping, we leave swap entries. */
> > > > > -		if (unlikely(details))
> > > > > -			continue;
> > > > > -
> > > > 
> > > > Good.
> > > > 
> > > > >  		if (!non_swap_entry(entry))
> > > > >  			rss[MM_SWAPENTS]--;
> 
> Right, I did not say at all what is wrong here, I suppose I intended
> to come back but then got tired - sorry.  Here we are dealing with a
> genuine swap entry.  No problem for exit or munmap or MADV_DONTNEED.
> No problem for truncation (but apparently there has been an unnoticed
> problem, that swapped-out COWed pages beyond the truncation point were
> not being deleted, while that "details continue" was still there).  But
> a problem for hole-punching: without the "details continue", this will
> go on to delete COWed pages which ought to be preserved - needs a
> should_zap_page() check, as you do with the migration entry.

I see, thanks for pointing out.

> 
> Except that here we have no page to check, so it looks like you'll
> have to change should_zap_page() to deal with this case too, or just
> check details->check_mapping directly.

Yeah I prefer this, as we don't have the page* pointer anyway.

> Which raises the question again
> of why I did not just use a boolean flag there originally: aah, I think
> I've found why.  In those days there was a horrible "optimization", for
> better performance on some benchmark I guess, which when you read from
> /dev/zero into a private mapping, would map the zero page there (look
> up read_zero_pagealigned() and zeromap_page_range() if you dare).  So
> there was another category of page to be skipped along with the anon
> COWs, and I didn't want multiple tests in the zap loop, so checking
> check_mapping against page->mapping did both.  I think nowadays you
> could do it by checking for PageAnon page (or genuine swap entry)
> instead.

It must be PageAnon already, isn't it?

> 
> > > > >  		else if (is_migration_entry(entry)) {
> > > > >  			struct page *page;
> > > > >  
> > > > >  			page = pfn_swap_entry_to_page(entry);
> > > > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > > > +				continue;
> > > > 
> > > > Good; though I don't think it's worth an "unlikely", and I say again,
> > > 
> > > Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> > > "unlikely" too around this check?
> > > 
> > > > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > > > 
> > > > David suggested naming its inversion should_zap_page(), yes, that's
> > > > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > > > but should_zap_page() for the inversion sounds good to me.
> > > 
> > > Ah sure, I'll rename it to should_zap_page() in a new patch before this.
> 
> Thanks; though now even that name is looking dubious - should_zap_entry()?

I guess my plan is simply keep should_zap_page() and check explicitly for real
swap entries.

> 
> > > 
> > > > 
> > > > And I'm pleased to see in one of Matthew's diffs that he intends to
> > > > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > > > from include/linux/mm.h.
> > > 
> > > Yeah it's only used in memory.c.  I put it there initially because I wanted
> > > zap_details user can reference what's that check mapping is all about, but
> > > maybe that's not necessary.
> > > 
> > > If you or Matthew could provide a link to that patch, I'll be happy to pick
> > > that up too with this series.  Or I can also do nothing assuming Matthew will
> > > handle it elsewhere.
> 
> In Linus's tree today: 3506659e18a6 ("mm: Add unmap_mapping_folio()").

Rebased.

> 
> > > 
> > > > 
> > > > >  			rss[mm_counter(page)]--;
> > > > >  		}
> 
> I have given no thought as to whether more "else"s are needed there.

It's hwpoison that's in the else.  Nothing else should.

I didn't mention it probably because I forgot.  I did think about it when
drafting, and IMHO we should simply zap that hwpoison entry because:

  (1) Zap means the user knows this data is meaningless, so at least we
      shouldn't SIGBUS for that anymore.

  (2) If we keep it there, it could errornously trigger SIGBUS later if the
      guest accessed that pte again somehow.

I plan to mention that in the commit message, but I can also add some comments
directly into the code.  Let me know your thoughts.

> 
> > > > >  		if (unlikely(!free_swap_and_cache(entry)))
> 
> (I think you moved that to a more relevant place in a later patch: good.)
> 
> I know you're also asking for test suggestions, but I'll leave it to you:
> hole-punching, truncation, swapping, compaction (to force migration).

Please read above - again I don't think we can write a reproducer, and that's
why I provided the "fake" reproducer only.

I could always be missing something, though. I'd be glad to be corrected,
again.

Thanks,
Peter Xu Jan. 21, 2022, 3:11 a.m. UTC | #13
On Thu, Jan 20, 2022 at 06:32:29PM +0800, Peter Xu wrote:
> > Except that here we have no page to check, so it looks like you'll
> > have to change should_zap_page() to deal with this case too, or just
> > check details->check_mapping directly.
> 
> Yeah I prefer this, as we don't have the page* pointer anyway.
> 
> > Which raises the question again
> > of why I did not just use a boolean flag there originally: aah, I think
> > I've found why.  In those days there was a horrible "optimization", for
> > better performance on some benchmark I guess, which when you read from
> > /dev/zero into a private mapping, would map the zero page there (look
> > up read_zero_pagealigned() and zeromap_page_range() if you dare).  So
> > there was another category of page to be skipped along with the anon
> > COWs, and I didn't want multiple tests in the zap loop, so checking
> > check_mapping against page->mapping did both.  I think nowadays you
> > could do it by checking for PageAnon page (or genuine swap entry)
> > instead.
> 
> It must be PageAnon already, isn't it?

I think I see what you meant now..

I assume the special case is gone, how about I switch zap_mappings back into
a boolean altogether in this patchset?  Thanks,
Peter Xu Jan. 21, 2022, 5:11 a.m. UTC | #14
On Fri, Jan 21, 2022 at 11:11:42AM +0800, Peter Xu wrote:
> On Thu, Jan 20, 2022 at 06:32:29PM +0800, Peter Xu wrote:
> > > Except that here we have no page to check, so it looks like you'll
> > > have to change should_zap_page() to deal with this case too, or just
> > > check details->check_mapping directly.
> > 
> > Yeah I prefer this, as we don't have the page* pointer anyway.
> > 
> > > Which raises the question again
> > > of why I did not just use a boolean flag there originally: aah, I think
> > > I've found why.  In those days there was a horrible "optimization", for
> > > better performance on some benchmark I guess, which when you read from
> > > /dev/zero into a private mapping, would map the zero page there (look
> > > up read_zero_pagealigned() and zeromap_page_range() if you dare).  So
> > > there was another category of page to be skipped along with the anon
> > > COWs, and I didn't want multiple tests in the zap loop, so checking
> > > check_mapping against page->mapping did both.  I think nowadays you
> > > could do it by checking for PageAnon page (or genuine swap entry)
> > > instead.
> > 
> > It must be PageAnon already, isn't it?
> 
> I think I see what you meant now..
> 
> I assume the special case is gone, how about I switch zap_mappings back into
> a boolean altogether in this patchset?  Thanks,

Oh, one more thing..

When reading the history and also your explanations above, I figured one thing
that may not be right for a long time, on zero page handling of zapping.

If to quote your comment above, we should keep the zero page entries too if
zap_details.zap_mapping is specified.  However it's not true for a long time, I
guess starting from when vm_normal_page() returns NULL for zero pfns.  I also
have a very strong feeling that in the old world the "page*" is non-NULL for
zero pages here.

So... I'm boldly thinking whether we should also need another fix upon the zero
page handling here too, probably before this whole patchset (so it'll be the
1st patch, it should directly apply to master) because if I'm right on above it
can be seen as another separate bug fix:

---8<---
diff --git a/mm/memory.c b/mm/memory.c
index f306e698a1e3..9b8348746e0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1320,11 +1320,18 @@ struct zap_details {
 static inline bool
 zap_skip_check_mapping(struct zap_details *details, struct page *page)
 {
-       if (!details || !page)
+       /* No detail pointer or no zap_mapping pointer, zap all */
+       if (!details || !details->zap_mapping)
                return false;
 
-       return details->zap_mapping &&
-               (details->zap_mapping != page_rmapping(page));
+       /*
+        * For example, the zero page.  If the user wants to keep the private
+        * pages, zero page should also be in count.
+        */
+       if (!page)
+               return true;
+
+       return details->zap_mapping != page_rmapping(page);
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
---8<---

page can be NULL for e.g. PFNMAP and when error occured too above.  I assume we
don't need to bother with them (e.g., I don't think PFNMAP or MIXED should
specify even_cows=false at all, because there's no page cache layer), though.
Mostly it's about how we should handle zero page right.

Thanks,
Hugh Dickins Jan. 24, 2022, 6:29 a.m. UTC | #15
On Thu, 20 Jan 2022, Peter Xu wrote:

> Hi, Hugh,
> 
> On Wed, Jan 12, 2022 at 07:47:40PM -0800, Hugh Dickins wrote:
> > > > IMHO it won't really reproduce because it seems all relevant shmem operations
> > > > (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> > > > migration (which needs the page lock too), as mentioned in previous cover
> > > > letters when I was still working on the old versions of this.  But it could be
> > > > possible that I missed something.
> > 
> > The latter.
> > 
> > There may still be some cases in which unmap_mapping_range() is called
> > on a single page with that page held locked, I have not checked; but in
> > general it is called on a range of pages, and would not have them locked.
> > 
> > The usual pattern when truncating/hole-punching/invalidating is to call
> > unmap_mapping_range() first, to do the bulk of the work efficiently,
> > scanning down the page tables without holding any page lock; but then
> > unmap racily faulted stragglers one by one while holding page lock
> > using unmap_mapping_page() just before deleting each from page cache.
> > 
> > So some relevant operations are done without page lock and some relevant
> > operations are done with page lock: both need to be considered, but you
> > are right that the page-locked ones will not encounter a migration entry
> > of that page (I haven't tried to see if that fact is useful here or not).
> > 
> > You may need to substitute "pages" or "page" or "folio" in the above.
> > 
> > > > 
> > > > While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> > > > punching we have the pre-unmap without lock:
> > > > 
> > > > 		if ((u64)unmap_end > (u64)unmap_start)
> > > > 			unmap_mapping_range(mapping, unmap_start,
> > > > 					    1 + unmap_end - unmap_start, 0);
> > > > 		shmem_truncate_range(inode, offset, offset + len - 1);
> > 
> > Oh, I should have read on, I've been wasting my time explaining above:
> > there you said the page must be locked, here you find that is not so.
> > 
> > > > 
> > > > But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> > > > which is with-lock.  Hence we have a small window to at least trigger the
> > > > warning above, even if it won't really mess up the accounting finally.
> > 
> > Well, there is still an opportunity to do the wrong thing.
> 
> So what I wanted to express is we can't trigger any real issue, so no way to
> write a valid reproducer even if we know it did it wrong..  And that's why I
> provided a "fake" one..
> 
> Let me explain a little bit further.  Sorry I know you know most of below, so
> it could be that you spend 5 minutes reading nonsense..  but I want to 100%
> sure we're on the same page.
> 
> Firstly, current code will have a problem if we:
> 
>   (1) Passed over a valid "zap_details" pointer to zap_pte_range, and,
>   (2) At the meantime if there can be a swap entry of any kind of below:
>         (2.1) real swap entry
>         (2.2) migration swap entry
> 
> Then we got doomed because we could have errornously skipped some swap entry
> that was also under the mapping while we shouldn't.
> 
> I didn't mention device exclusive/private entries are fine because they're
> checked before the "details continue" check.
> 
> I didn't mention hwpoison entry too; but I'll discuss it later..
> 
> Then, when do we have cases above (1) or (2)?
> 
> (1) is impossible because when there's zap_details in current code it means the
> page is backing a fs mapping, hence not PageAnon.  We can only hit (2).

No: a MAP_PRIVATE fs mapping has a PageAnon page wherever CopyOnWrite
has been done.

> 
> What can hit (2)?  I only know shmem...

No: I cannot see any special case for shmem versus file versus anon in
try_to_migrate_one().

> 
> Next, as a conclusion, what we want to reproduce is: some caller calls
> zap_pte_range() with zap_details, but accidentally there's a migration entry.
> 
> It's fairly easy to find this, as what my "fake" reproducer was doing.
> 
> However, as I stated before, all these use cases always have another step to
> take the lock and redo the range.  Then even if some migration entry got
> wrongly skipped it'll always be fixed.  What we need to find is some caller
> that calls zap_pte_range() without later taking the page lock and redo that.
> That's the only possibility to trigger a real issue on the shmem accounting.

I agree that the fallback "if (folio_mapped() unmap_mapping_folio()",
while holding folio lock, ensures that there cannot be a migration entry
substituted for present pte at that time, so no problem if migration entry
was wrongly skipped on the earlier unlocked pass.

But you're forgetting the complementary mistake: that the earlier unlocked
pass might have zapped a migration entry (corresponding to an anon COWed
page) when it should have skipped it (while punching a hole).

> 
> To prove this, let's see all the possible call stacks...
> 
> zap_pte_range() will only be called with zap_details pointer with below two
> functions: 
> 
>         unmap_mapping_folio
>         unmap_mapping_pages
> 
> unmap_mapping_folio is already having the page lock so it's not going to work.
> Then the callers of unmap_mapping_pages are:
> 
>     unmap_mapping_pages
>         [1] collapse_file
>         [?] unmap_mapping_range
>                 [?] unmap_shared_mapping_range
>                         [?] vmw_bo_dirty_unmap (gpu/drm)
>                         [?] inode_go_sync (gfs2)
>                 [1] shmem_setattr
>                 [1] shmem_fallocate
>                 [1] truncate_pagecache
>                 [1] truncate_pagecache_range
>         [1] invalidate_inode_pages2_range
> 
> It's a reversed tree of call stacks, when I marked [1] means this call site
> does the "unmap then take page lock and redo" thing, so it can't reproduce.
> 
> The only outliers are:
> 
>         [?] vmw_bo_dirty_unmap (gpu/drm)
>         [?] inode_go_sync (gfs2)
> 
> Which I marked with [?].
> 
> However I don't think they'll have migration entries at all, or am I wrong?

To be honest, I have not made the effort to understand your point there,
since I already suggested above that three of your assumptions are wrong.
But I think we should presume that the upper levels have chosen the right
setting for even_cows, and zap_pte_range() just needs to respect it.

> 
> I hope I explained why I think we can't write a reproducer.. but I still think
> we should fix the problem, because we can't tell whether it'll go wrong in the
> future with a new caller who will not take the page lock to redo the zap.
> 
> > 
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > >  			continue;
> > > > > >  		}
> > > > > >  
> > > > > > -		/* If details->check_mapping, we leave swap entries. */
> > > > > > -		if (unlikely(details))
> > > > > > -			continue;
> > > > > > -
> > > > > 
> > > > > Good.
> > > > > 
> > > > > >  		if (!non_swap_entry(entry))
> > > > > >  			rss[MM_SWAPENTS]--;
> > 
> > Right, I did not say at all what is wrong here, I suppose I intended
> > to come back but then got tired - sorry.  Here we are dealing with a
> > genuine swap entry.  No problem for exit or munmap or MADV_DONTNEED.
> > No problem for truncation (but apparently there has been an unnoticed
> > problem, that swapped-out COWed pages beyond the truncation point were
> > not being deleted, while that "details continue" was still there).  But
> > a problem for hole-punching: without the "details continue", this will
> > go on to delete COWed pages which ought to be preserved - needs a
> > should_zap_page() check, as you do with the migration entry.
> 
> I see, thanks for pointing out.
> 
> > 
> > Except that here we have no page to check, so it looks like you'll
> > have to change should_zap_page() to deal with this case too, or just
> > check details->check_mapping directly.
> 
> Yeah I prefer this, as we don't have the page* pointer anyway.
> 
> > Which raises the question again
> > of why I did not just use a boolean flag there originally: aah, I think
> > I've found why.  In those days there was a horrible "optimization", for
> > better performance on some benchmark I guess, which when you read from
> > /dev/zero into a private mapping, would map the zero page there (look
> > up read_zero_pagealigned() and zeromap_page_range() if you dare).  So
> > there was another category of page to be skipped along with the anon
> > COWs, and I didn't want multiple tests in the zap loop, so checking
> > check_mapping against page->mapping did both.  I think nowadays you
> > could do it by checking for PageAnon page (or genuine swap entry)
> > instead.
> 
> It must be PageAnon already, isn't it?

I did not understand what you were asking there; but in your followup
mail, I think you came to understand what I meant better.  Yes, I
believe you could safely replace struct address_space *zap_mapping
by a more understandable boolean (skip_cows? its inverse would be
easier to understand, but we don't want almost everyone to have to
pass a zap_details initialized to true there).

> 
> > 
> > > > > >  		else if (is_migration_entry(entry)) {
> > > > > >  			struct page *page;
> > > > > >  
> > > > > >  			page = pfn_swap_entry_to_page(entry);
> > > > > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > > > > +				continue;
> > > > > 
> > > > > Good; though I don't think it's worth an "unlikely", and I say again,
> > > > 
> > > > Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> > > > "unlikely" too around this check?
> > > > 
> > > > > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > > > > 
> > > > > David suggested naming its inversion should_zap_page(), yes, that's
> > > > > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > > > > but should_zap_page() for the inversion sounds good to me.
> > > > 
> > > > Ah sure, I'll rename it to should_zap_page() in a new patch before this.
> > 
> > Thanks; though now even that name is looking dubious - should_zap_entry()?
> 
> I guess my plan is simply keep should_zap_page() and check explicitly for real
> swap entries.

I won't know whether that's right or not until seeing the actual patch.
The important thing to keep in mind is that the treatment of a
migration entry needs to match the nature of the page it is migrating:
if skip_cows, then the migration entry of a PageAnon should be skipped
not zapped, just as a PageAnon or a real swap entry would be.

The fact that a migration entry is encoded much like a swap entry,
and a real swap entry corresponds to a PageAnon, is misleading here.

> 
> > 
> > > > 
> > > > > 
> > > > > And I'm pleased to see in one of Matthew's diffs that he intends to
> > > > > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > > > > from include/linux/mm.h.
> > > > 
> > > > Yeah it's only used in memory.c.  I put it there initially because I wanted
> > > > zap_details user can reference what's that check mapping is all about, but
> > > > maybe that's not necessary.
> > > > 
> > > > If you or Matthew could provide a link to that patch, I'll be happy to pick
> > > > that up too with this series.  Or I can also do nothing assuming Matthew will
> > > > handle it elsewhere.
> > 
> > In Linus's tree today: 3506659e18a6 ("mm: Add unmap_mapping_folio()").
> 
> Rebased.
> 
> > 
> > > > 
> > > > > 
> > > > > >  			rss[mm_counter(page)]--;
> > > > > >  		}
> > 
> > I have given no thought as to whether more "else"s are needed there.
> 
> It's hwpoison that's in the else.  Nothing else should.
> 
> I didn't mention it probably because I forgot.  I did think about it when
> drafting, and IMHO we should simply zap that hwpoison entry because:
> 
>   (1) Zap means the user knows this data is meaningless, so at least we
>       shouldn't SIGBUS for that anymore.
> 
>   (2) If we keep it there, it could errornously trigger SIGBUS later if the
>       guest accessed that pte again somehow.
> 
> I plan to mention that in the commit message, but I can also add some comments
> directly into the code.  Let me know your thoughts.

It's comes down, again, to what punching a hole in a file should do
to the private data that has been COWed from it.  Strictly, it should
not interfere with it, even when the COWed page has become poisonous:
the entry should be left in to generate SIGBUS.  Whereas ordinary
unmapping or truncating or MADV_DONTNEEDing would zap it.

> 
> > 
> > > > > >  		if (unlikely(!free_swap_and_cache(entry)))
> > 
> > (I think you moved that to a more relevant place in a later patch: good.)
> > 
> > I know you're also asking for test suggestions, but I'll leave it to you:
> > hole-punching, truncation, swapping, compaction (to force migration).
> 
> Please read above - again I don't think we can write a reproducer, and that's
> why I provided the "fake" reproducer only.

I haven't studied your "fake" reproducer at all, I guess I'm not all that
interested in fakes.  An actual reproducer would be able to show that the
current code (5.17-rc1 and recent) is zapping COWed pages in MAP_PRIVATE
areas when a hole is punched in the corresponding file.  Or, would it
show that I've been wrong all along in what I've said on this?

(Yes, our repeated back and forth on this does make me wonder whether
it would have been easier to define hole-punch as deleting COWs too:
but it's years too late to change our minds on that; and I think a
closer examination might show that it's difficult to guarantee the
deletion of COWs when punching a hole - traditionally, truncation
has leant on i_size for consistent implementation, and hole-punching
has been much trickier, because it does not change i_size at all.)

Hugh

> 
> I could always be missing something, though. I'd be glad to be corrected,
> again.
> 
> Thanks,
> 
> -- 
> Peter Xu
Hugh Dickins Jan. 24, 2022, 6:51 a.m. UTC | #16
On Fri, 21 Jan 2022, Peter Xu wrote:
> 
> Oh, one more thing..
> 
> When reading the history and also your explanations above, I figured one thing
> that may not be right for a long time, on zero page handling of zapping.
> 
> If to quote your comment above, we should keep the zero page entries too if
> zap_details.zap_mapping is specified.  However it's not true for a long time, I
> guess starting from when vm_normal_page() returns NULL for zero pfns.  I also
> have a very strong feeling that in the old world the "page*" is non-NULL for
> zero pages here.
> 
> So... I'm boldly thinking whether we should also need another fix upon the zero
> page handling here too, probably before this whole patchset (so it'll be the
> 1st patch, it should directly apply to master) because if I'm right on above it
> can be seen as another separate bug fix:
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index f306e698a1e3..9b8348746e0b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1320,11 +1320,18 @@ struct zap_details {
>  static inline bool
>  zap_skip_check_mapping(struct zap_details *details, struct page *page)
>  {
> -       if (!details || !page)
> +       /* No detail pointer or no zap_mapping pointer, zap all */
> +       if (!details || !details->zap_mapping)
>                 return false;
>  
> -       return details->zap_mapping &&
> -               (details->zap_mapping != page_rmapping(page));
> +       /*
> +        * For example, the zero page.  If the user wants to keep the private
> +        * pages, zero page should also be in count.
> +        */
> +       if (!page)
> +               return true;
> +
> +       return details->zap_mapping != page_rmapping(page);
>  }
>  
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> ---8<---
> 
> page can be NULL for e.g. PFNMAP and when error occured too above.  I assume we
> don't need to bother with them (e.g., I don't think PFNMAP or MIXED should
> specify even_cows=false at all, because there's no page cache layer), though.
> Mostly it's about how we should handle zero page right.

I have not understood the above.

I don't know of any problem that needs fixing with the zero page:
how do you suppose the zero page gets into a truncatable or hole-punchable
mapping?  We use it for read faults in anonymous mappings.  And I told the
story of how once-upon-a-time it could get inserted into any mapping by
reading from /dev/zero, but that odd case was dropped years ago.  And I
am open to (even encouraging) a change to make use of zero page for read
faults of holes in shmem: but that's potential future work, which would
require some changes elsewhere (though perhaps none here: the zero page
could never be used for the result of a COW).

Please explain the zero page problem you hope to fix here.

Hugh
Peter Xu Jan. 24, 2022, 8:54 a.m. UTC | #17
On Sun, Jan 23, 2022 at 10:29:40PM -0800, Hugh Dickins wrote:
> > However, as I stated before, all these use cases always have another step to
> > take the lock and redo the range.  Then even if some migration entry got
> > wrongly skipped it'll always be fixed.  What we need to find is some caller
> > that calls zap_pte_range() without later taking the page lock and redo that.
> > That's the only possibility to trigger a real issue on the shmem accounting.
> 
> I agree that the fallback "if (folio_mapped() unmap_mapping_folio()",
> while holding folio lock, ensures that there cannot be a migration entry
> substituted for present pte at that time, so no problem if migration entry
> was wrongly skipped on the earlier unlocked pass.
> 
> But you're forgetting the complementary mistake: that the earlier unlocked
> pass might have zapped a migration entry (corresponding to an anon COWed
> page) when it should have skipped it (while punching a hole).

IMHO we won't wrongly zap a migration entry because when it's file backed we've
got non-NULL zap_details, so we'll skip all migration entries.  IOW, we can
only wrongly skip some entries, not wrongly zap some.

But I get your point, and thanks for pointing out what I missed - I think I
forgot the private mappings completely somehow when writting that up..

I have a quick idea on reproducer now (perhaps file size shrinking on private
pages being swapped out), I'll try to write a real reproducer and update later.

[...]

> I did not understand what you were asking there; but in your followup
> mail, I think you came to understand what I meant better.  Yes, I
> believe you could safely replace struct address_space *zap_mapping
> by a more understandable boolean (skip_cows? its inverse would be
> easier to understand, but we don't want almost everyone to have to
> pass a zap_details initialized to true there).

The only even_cows==true for zap_details is with unmap_mapping_range(), where
its caller passed over even_cows==true as parameter.  So IMHO that helper
helped to construct the zap_details anyway.

I'll try it out starting with naming it zap_details.even_cows; I'll make it the
last patch as a cleanup.

> > > > > > >  			rss[mm_counter(page)]--;
> > > > > > >  		}
> > > 
> > > I have given no thought as to whether more "else"s are needed there.
> > 
> > It's hwpoison that's in the else.  Nothing else should.
> > 
> > I didn't mention it probably because I forgot.  I did think about it when
> > drafting, and IMHO we should simply zap that hwpoison entry because:
> > 
> >   (1) Zap means the user knows this data is meaningless, so at least we
> >       shouldn't SIGBUS for that anymore.
> > 
> >   (2) If we keep it there, it could errornously trigger SIGBUS later if the
> >       guest accessed that pte again somehow.
> > 
> > I plan to mention that in the commit message, but I can also add some comments
> > directly into the code.  Let me know your thoughts.
> 
> It's comes down, again, to what punching a hole in a file should do
> to the private data that has been COWed from it.  Strictly, it should
> not interfere with it, even when the COWed page has become poisonous:
> the entry should be left in to generate SIGBUS.  Whereas ordinary
> unmapping or truncating or MADV_DONTNEEDing would zap it.

Makes sense, I'll take care of that in the new version too.  Thanks,
Peter Xu Jan. 24, 2022, 9:13 a.m. UTC | #18
On Sun, Jan 23, 2022 at 10:51:57PM -0800, Hugh Dickins wrote:
> I don't know of any problem that needs fixing with the zero page:
> how do you suppose the zero page gets into a truncatable or hole-punchable
> mapping?  We use it for read faults in anonymous mappings.  And I told the
> story of how once-upon-a-time it could get inserted into any mapping by
> reading from /dev/zero, but that odd case was dropped years ago.  And I
> am open to (even encouraging) a change to make use of zero page for read
> faults of holes in shmem: but that's potential future work, which would
> require some changes elsewhere (though perhaps none here: the zero page
> could never be used for the result of a COW).
> 
> Please explain the zero page problem you hope to fix here.

After I tried to learn the old/new worlds of zero page somehow I thought there
can be zero pfns installed, but it seems not at all..

Please ignore above, sorry for the noise.
Peter Xu Jan. 24, 2022, 11:01 a.m. UTC | #19
On Mon, Jan 24, 2022 at 04:54:29PM +0800, Peter Xu wrote:
> I have a quick idea on reproducer now (perhaps file size shrinking on private
> pages being swapped out), I'll try to write a real reproducer and update later.

Here's the reproducer..

===8<===
#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>

int page_size;
int shmem_fd;
char *buffer;

void main(void)
{
	int ret;
	char val;

	page_size = getpagesize();
	shmem_fd = memfd_create("test", 0);
	assert(shmem_fd >= 0);

	ret = ftruncate(shmem_fd, page_size * 2);
	assert(ret == 0);

	buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
			MAP_PRIVATE, shmem_fd, 0);
	assert(buffer != MAP_FAILED);

	/* Write private page, swap it out */
	buffer[page_size] = 1;
	madvise(buffer, page_size * 2, MADV_PAGEOUT);

	/* This should drop private buffer[page_size] already */
	ret = ftruncate(shmem_fd, page_size);
	assert(ret == 0);
	/* Recover the size */
	ret = ftruncate(shmem_fd, page_size * 2);
	assert(ret == 0);

	/* Re-read the data, it should be all zero */
	val = buffer[page_size];
	if (val == 0)
		printf("Good\n");
	else
		printf("BUG\n");
}
===8<===

I'm looking for a correct Fixes for the 1st patch, but afaict it seems not
right even in the initial git commit, which traces back to the "history" git
repo of your commit:

dd9fd0e03de6 ("[PATCH] rmap: nonlinear truncation", 2004-04-17)

Where we have:

		/*
		 * If details->check_mapping, we leave swap entries;
		 * if details->nonlinear_vma, we leave file entries.
		 */
		if (unlikely(details))
			continue;
		if (!pte_file(ptent))
			free_swap_and_cache(pte_to_swp_entry(ptent));

As I don't know how it could be right if the user only specified
details->nonlinear_vma but keeping details->check_mapping==NULL.. there, and
there're definitely callers for that, e.g.:

static long madvise_dontneed(struct vm_area_struct * vma,
			     unsigned long start, unsigned long end)
{
	if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
		return -EINVAL;

	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
		struct zap_details details = {
			.nonlinear_vma = vma,
			.last_index = ULONG_MAX,
		};
		zap_page_range(vma, start, end - start, &details);
	} else
		zap_page_range(vma, start, end - start, NULL);
	return 0;
}

So I plan to add the Fixes to the initial git commit, 1da177e4c3f4
("Linux-2.6.12-rc2", 2005-04-16).

Does it sound right to you?

Thanks,

--
Peter Xu
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..e454f3c6aeb9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1382,16 +1382,14 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
-			continue;
-
 		if (!non_swap_entry(entry))
 			rss[MM_SWAPENTS]--;
 		else if (is_migration_entry(entry)) {
 			struct page *page;
 
 			page = pfn_swap_entry_to_page(entry);
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))