mm: remove lock_page_memcg() from rmap

Message ID	20221123181838.1373440-1-hannes@cmpxchg.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Shakeel Butt <shakeelb@google.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: remove lock_page_memcg() from rmap Date: Wed, 23 Nov 2022 13:18:38 -0500 Message-Id: <20221123181838.1373440-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: remove lock_page_memcg() from rmap \| expand mm: remove lock_page_memcg() from rmap

Johannes Weiner Nov. 23, 2022, 6:18 p.m. UTC

rmap changes (mapping and unmapping) of a page currently take
lock_page_memcg() to serialize 1) update of the mapcount and the
cgroup mapped counter with 2) cgroup moving the page and updating the
old cgroup and the new cgroup counters based on page_mapped().

Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), we used to reassign all pages that could be found
on a cgroup's LRU list on deletion - something that rmap didn't
naturally serialize against. Since that commit, however, the only
pages that get moved are those mapped into page tables of a task
that's being migrated. In that case, the pte lock is always held (and
we know the page is mapped), which keeps rmap changes at bay already.

The additional lock_page_memcg() by rmap is redundant. Remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 35 ++++++++++++++++++++---------------
 mm/rmap.c       | 12 ------------
 2 files changed, 20 insertions(+), 27 deletions(-)

Shakeel Butt Nov. 23, 2022, 6:34 p.m. UTC | #1

On Wed, Nov 23, 2022 at 10:18 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> rmap changes (mapping and unmapping) of a page currently take
> lock_page_memcg() to serialize 1) update of the mapcount and the
> cgroup mapped counter with 2) cgroup moving the page and updating the
> old cgroup and the new cgroup counters based on page_mapped().
>
> Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), we used to reassign all pages that could be found
> on a cgroup's LRU list on deletion - something that rmap didn't
> naturally serialize against. Since that commit, however, the only
> pages that get moved are those mapped into page tables of a task
> that's being migrated. In that case, the pte lock is always held (and
> we know the page is mapped), which keeps rmap changes at bay already.
>
> The additional lock_page_memcg() by rmap is redundant. Remove it.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

Hugh Dickins Nov. 24, 2022, 6:03 a.m. UTC | #2

On Wed, 23 Nov 2022, Johannes Weiner wrote:

> rmap changes (mapping and unmapping) of a page currently take
> lock_page_memcg() to serialize 1) update of the mapcount and the
> cgroup mapped counter with 2) cgroup moving the page and updating the
> old cgroup and the new cgroup counters based on page_mapped().
> 
> Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), we used to reassign all pages that could be found
> on a cgroup's LRU list on deletion - something that rmap didn't
> naturally serialize against. Since that commit, however, the only
> pages that get moved are those mapped into page tables of a task
> that's being migrated. In that case, the pte lock is always held (and
> we know the page is mapped), which keeps rmap changes at bay already.
> 
> The additional lock_page_memcg() by rmap is redundant. Remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you, I love it: but with sorrow and shame, NAK to this version.

I was gearing up to rush in the crash fix at the bottom, when testing
showed that the new VM_WARN_ON_ONCE(!folio_mapped(folio)) actually hits.

So I've asked Stephen to drop this mm-unstable commit from -next for
tonight, while we think about what more is needed.

I was disbelieving when I saw the warning, couldn't understand at all.
But a look at get_mctgt_type() shatters my illusion: it's doesn't just
return a page for pte_present(ptent), it goes off looking up swap
cache and page cache; plus I've no idea whether an MC_TARGET_DEVICE
page would appear as folio_mapped() or not.

Does that mean that we just have to reinstate the folio_mapped() checks
in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
commit?  Or does it invalidate the whole project to remove
lock_page_memcg() from mm/rmap.c?

Too disappointed to think about it more tonight :-(
Hugh


> ---
>  mm/memcontrol.c | 35 ++++++++++++++++++++---------------
>  mm/rmap.c       | 12 ------------
>  2 files changed, 20 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23750cec0036..52b86ca7a78e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5676,7 +5676,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
>   * @from: mem_cgroup which the page is moved from.
>   * @to:	mem_cgroup which the page is moved to. @from != @to.
>   *
> - * The caller must make sure the page is not on LRU (isolate_page() is useful.)
> + * This function acquires folio_lock() and folio_lock_memcg(). The
> + * caller must exclude all other possible ways of accessing
> + * page->memcg, such as LRU isolation (to lock out isolation) and
> + * having the page mapped and pte-locked (to lock out rmap).
>   *
>   * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
>   * from old cgroup.
> @@ -5696,6 +5699,13 @@ static int mem_cgroup_move_account(struct page *page,
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  	VM_BUG_ON(compound && !folio_test_large(folio));
>  
> +	/*
> +	 * We're only moving pages mapped into the moving process's
> +	 * page tables. The caller's pte lock prevents rmap from
> +	 * removing the NR_x_MAPPED state while we transfer it.
> +	 */
> +	VM_WARN_ON_ONCE(!folio_mapped(folio));
> +
>  	/*
>  	 * Prevent mem_cgroup_migrate() from looking at
>  	 * page's memory cgroup of its source page while we change it.
> @@ -5715,30 +5725,25 @@ static int mem_cgroup_move_account(struct page *page,
>  	folio_memcg_lock(folio);
>  
>  	if (folio_test_anon(folio)) {
> -		if (folio_mapped(folio)) {
> -			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> -			if (folio_test_transhuge(folio)) {
> -				__mod_lruvec_state(from_vec, NR_ANON_THPS,
> -						   -nr_pages);
> -				__mod_lruvec_state(to_vec, NR_ANON_THPS,
> -						   nr_pages);
> -			}
> +		__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> +		__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +
> +		if (folio_test_transhuge(folio)) {
> +			__mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages);
> +			__mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages);
>  		}
>  	} else {
>  		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
>  		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
>  
> +		__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
> +		__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
> +
>  		if (folio_test_swapbacked(folio)) {
>  			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
>  			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
>  		}
>  
> -		if (folio_mapped(folio)) {
> -			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
> -		}
> -
>  		if (folio_test_dirty(folio)) {
>  			struct address_space *mapping = folio_mapping(folio);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 459dc1c44d8a..11a4894158db 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1222,9 +1222,6 @@ void page_add_anon_rmap(struct page *page,
>  	bool compound = flags & RMAP_COMPOUND;
>  	bool first = true;
>  
> -	if (unlikely(PageKsm(page)))
> -		lock_page_memcg(page);
> -
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
>  		first = atomic_inc_and_test(&page->_mapcount);
> @@ -1254,9 +1251,6 @@ void page_add_anon_rmap(struct page *page,
>  	if (nr)
>  		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
>  
> -	if (unlikely(PageKsm(page)))
> -		unlock_page_memcg(page);
> -
>  	/* address might be in next vma when migration races vma_adjust */
>  	else if (first)
>  		__page_set_anon_rmap(page, vma, address,
> @@ -1321,7 +1315,6 @@ void page_add_file_rmap(struct page *page,
>  	bool first;
>  
>  	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> -	lock_page_memcg(page);
>  
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
> @@ -1349,7 +1342,6 @@ void page_add_file_rmap(struct page *page,
>  			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
>  	if (nr)
>  		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
> -	unlock_page_memcg(page);
>  
>  	mlock_vma_page(page, vma, compound);
>  }
> @@ -1378,8 +1370,6 @@ void page_remove_rmap(struct page *page,
>  		return;
>  	}
>  
> -	lock_page_memcg(page);
> -
>  	/* Is page being unmapped by PTE? Is this its last map to be removed? */
>  	if (likely(!compound)) {
>  		last = atomic_add_negative(-1, &page->_mapcount);
> @@ -1427,8 +1417,6 @@ void page_remove_rmap(struct page *page,
>  	 * and remember that it's only reliable while mapped.
>  	 */
>  
> -	unlock_page_memcg(page);
> -
>  	munlock_vma_page(page, vma, compound);
>  }
>  
> -- 
> 2.38.1

[PATCH] mm: remove lock_page_memcg() from rmap - fix

Blame me for the hidden "else", which now does the wrong thing, leaving
page's anon_vma unset, then VM_BUG_ON before do_swap_page's set_pte_at.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/rmap.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 11a4894158db..5a8d27fdc644 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1251,13 +1251,14 @@ void page_add_anon_rmap(struct page *page,
 	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 
-	/* address might be in next vma when migration races vma_adjust */
-	else if (first)
-		__page_set_anon_rmap(page, vma, address,
-				     !!(flags & RMAP_EXCLUSIVE));
-	else
-		__page_check_anon_rmap(page, vma, address);
-
+	if (!PageKsm(page)) {
+		/* address may be in next vma if migration races vma_adjust */
+		if (first)
+			__page_set_anon_rmap(page, vma, address,
+					     !!(flags & RMAP_EXCLUSIVE));
+		else
+			__page_check_anon_rmap(page, vma, address);
+	}
 	mlock_vma_page(page, vma, compound);
 }

Johannes Weiner Nov. 28, 2022, 4:59 p.m. UTC | #3

On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote:
> On Wed, 23 Nov 2022, Johannes Weiner wrote:
> > rmap changes (mapping and unmapping) of a page currently take
> > lock_page_memcg() to serialize 1) update of the mapcount and the
> > cgroup mapped counter with 2) cgroup moving the page and updating the
> > old cgroup and the new cgroup counters based on page_mapped().
> > 
> > Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
> > offlined groups"), we used to reassign all pages that could be found
> > on a cgroup's LRU list on deletion - something that rmap didn't
> > naturally serialize against. Since that commit, however, the only
> > pages that get moved are those mapped into page tables of a task
> > that's being migrated. In that case, the pte lock is always held (and
> > we know the page is mapped), which keeps rmap changes at bay already.
> > 
> > The additional lock_page_memcg() by rmap is redundant. Remove it.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thank you, I love it: but with sorrow and shame, NAK to this version.
> 
> I was gearing up to rush in the crash fix at the bottom, when testing
> showed that the new VM_WARN_ON_ONCE(!folio_mapped(folio)) actually hits.
> 
> So I've asked Stephen to drop this mm-unstable commit from -next for
> tonight, while we think about what more is needed.
> 
> I was disbelieving when I saw the warning, couldn't understand at all.
> But a look at get_mctgt_type() shatters my illusion: it's doesn't just
> return a page for pte_present(ptent), it goes off looking up swap
> cache and page cache; plus I've no idea whether an MC_TARGET_DEVICE
> page would appear as folio_mapped() or not.

Thanks for catching this.

A device_private pte always has a matching mapcount in the fake page
it points to, so we should be good here. Those pages migrate back and
forth between system and device memory, relying on the migration
code's unmap and restore bits. Hence they participate in regular rmap.

The swapcache/pagecache bit was a brainfart. We acquire the folio lock
in move_account(), which would lock out concurrent faults. If it's not
mapped, I don't see how it could become mapped behind our backs. But
we do need to be prepared for it to be unmapped.

> Does that mean that we just have to reinstate the folio_mapped() checks
> in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
> commit?  Or does it invalidate the whole project to remove
> lock_page_memcg() from mm/rmap.c?

I think we have to bring back the folio_mapped() conditional and
update the comments. But it shouldn't invalidate the whole project.

I'll triple check this, however.

Thanks

Johannes Weiner Nov. 29, 2022, 7:08 p.m. UTC | #4

On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote:
> On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote:
> The swapcache/pagecache bit was a brainfart. We acquire the folio lock
> in move_account(), which would lock out concurrent faults. If it's not
> mapped, I don't see how it could become mapped behind our backs. But
> we do need to be prepared for it to be unmapped.

Welp, that doesn't protect us from the inverse, where the page is
mapped elsewhere and the other ptes are going away. So this won't be
enough, unfortunately.

> > Does that mean that we just have to reinstate the folio_mapped() checks
> > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
> > commit?  Or does it invalidate the whole project to remove
> > lock_page_memcg() from mm/rmap.c?

Short of further restricting the pages that can be moved, I don't see
how we can get rid of the cgroup locks in rmap after all. :(

We can try limiting move candidates to present ptes. But maybe it's
indeed time to deprecate the legacy charge moving altogether, and get
rid of the entire complication.

Hugh, Shakeel, Michal, what do you think?

Linus Torvalds Nov. 29, 2022, 7:23 p.m. UTC | #5

On Tue, Nov 29, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We can try limiting move candidates to present ptes. But maybe it's
> indeed time to deprecate the legacy charge moving altogether, and get
> rid of the entire complication.

Please. If that's what it takes..

           Linus

Shakeel Butt Nov. 29, 2022, 7:42 p.m. UTC | #6

On Tue, Nov 29, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote:
> > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote:
> > The swapcache/pagecache bit was a brainfart. We acquire the folio lock
> > in move_account(), which would lock out concurrent faults. If it's not
> > mapped, I don't see how it could become mapped behind our backs. But
> > we do need to be prepared for it to be unmapped.
>
> Welp, that doesn't protect us from the inverse, where the page is
> mapped elsewhere and the other ptes are going away. So this won't be
> enough, unfortunately.
>
> > > Does that mean that we just have to reinstate the folio_mapped() checks
> > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
> > > commit?  Or does it invalidate the whole project to remove
> > > lock_page_memcg() from mm/rmap.c?
>
> Short of further restricting the pages that can be moved, I don't see
> how we can get rid of the cgroup locks in rmap after all. :(
>
> We can try limiting move candidates to present ptes. But maybe it's
> indeed time to deprecate the legacy charge moving altogether, and get
> rid of the entire complication.
>
> Hugh, Shakeel, Michal, what do you think?

I am on-board.

Hugh Dickins Nov. 30, 2022, 7:33 a.m. UTC | #7

On Tue, 29 Nov 2022, Johannes Weiner wrote:
> On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote:
> > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote:
> > The swapcache/pagecache bit was a brainfart. We acquire the folio lock
> > in move_account(), which would lock out concurrent faults. If it's not
> > mapped, I don't see how it could become mapped behind our backs. But
> > we do need to be prepared for it to be unmapped.
> 
> Welp, that doesn't protect us from the inverse, where the page is
> mapped elsewhere and the other ptes are going away. So this won't be
> enough, unfortunately.
> 
> > > Does that mean that we just have to reinstate the folio_mapped() checks
> > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
> > > commit?  Or does it invalidate the whole project to remove
> > > lock_page_memcg() from mm/rmap.c?
> 
> Short of further restricting the pages that can be moved, I don't see
> how we can get rid of the cgroup locks in rmap after all. :(
> 
> We can try limiting move candidates to present ptes. But maybe it's
> indeed time to deprecate the legacy charge moving altogether, and get
> rid of the entire complication.
> 
> Hugh, Shakeel, Michal, what do you think?

I'm certainly not against deprecating it - it's a largish body of odd
code, which poses signficant problems, yet is very seldom used; but I
feel that we'd all like to see it gone from rmap quicker that it can
be fully deprecated out of existence.

I do wonder if any user would notice, if we quietly removed its
operation on non-present ptes; certainly there *might* be users
relying on that behaviour, but I doubt that many would.

Alternatively (although I think Linus's objection to it in rmap is on
both aesthetic and performance grounds, and retaining any trace of it
in rmap.c still fails the aesthetic), can there be some static-keying
done, to eliminate (un)lock_page_memcg() overhead for all but those few
who actually indulge in moving memcg charge at immigrate?  (But I think
you would have already done that if it were possible.)

Hugh

Michal Hocko Nov. 30, 2022, 12:50 p.m. UTC | #8

On Tue 29-11-22 14:08:34, Johannes Weiner wrote:
[...]
> Short of further restricting the pages that can be moved, I don't see
> how we can get rid of the cgroup locks in rmap after all. :(

:(

> We can try limiting move candidates to present ptes. But maybe it's
> indeed time to deprecate the legacy charge moving altogether, and get
> rid of the entire complication.
> 
> Hugh, Shakeel, Michal, what do you think?

I do agree with Hugh that going part way will likely not solve the
general maintenance burden. I am OK with dropping the functionality but
we should at least add a big fat warning if anybody wants to enable
memory.move_charge_at_immigrate. If there are any workload which
absolutely depend on the feature (which would make them impossible to
migrate to v2 btw) then we want to hear about that and address in some
way.

An incremental deprecation by limiting to present ptes sounds like this
semantic change might get overlooked for an extended time.

Shakeel Butt Nov. 30, 2022, 4:42 p.m. UTC | #9

On Tue, Nov 29, 2022 at 11:33 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 29 Nov 2022, Johannes Weiner wrote:
> > On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote:
> > > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote:
> > > The swapcache/pagecache bit was a brainfart. We acquire the folio lock
> > > in move_account(), which would lock out concurrent faults. If it's not
> > > mapped, I don't see how it could become mapped behind our backs. But
> > > we do need to be prepared for it to be unmapped.
> >
> > Welp, that doesn't protect us from the inverse, where the page is
> > mapped elsewhere and the other ptes are going away. So this won't be
> > enough, unfortunately.
> >
> > > > Does that mean that we just have to reinstate the folio_mapped() checks
> > > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the
> > > > commit?  Or does it invalidate the whole project to remove
> > > > lock_page_memcg() from mm/rmap.c?
> >
> > Short of further restricting the pages that can be moved, I don't see
> > how we can get rid of the cgroup locks in rmap after all. :(
> >
> > We can try limiting move candidates to present ptes. But maybe it's
> > indeed time to deprecate the legacy charge moving altogether, and get
> > rid of the entire complication.
> >
> > Hugh, Shakeel, Michal, what do you think?
>
> I'm certainly not against deprecating it - it's a largish body of odd
> code, which poses signficant problems, yet is very seldom used; but I
> feel that we'd all like to see it gone from rmap quicker that it can
> be fully deprecated out of existence.
>
> I do wonder if any user would notice, if we quietly removed its
> operation on non-present ptes; certainly there *might* be users
> relying on that behaviour, but I doubt that many would.
>
> Alternatively (although I think Linus's objection to it in rmap is on
> both aesthetic and performance grounds, and retaining any trace of it
> in rmap.c still fails the aesthetic), can there be some static-keying
> done, to eliminate (un)lock_page_memcg() overhead for all but those few
> who actually indulge in moving memcg charge at immigrate?  (But I think
> you would have already done that if it were possible.)
>

My preference would be going with the removal of non-present ptes over
static-key in [un]lock_page_memcg().

How about the following steps:

1. Add warning in memory.move_charge_at_immigrate now (6.1/6.2) that
this is going away and also backport it to the stable kernels.

2. For 6.2 (or 6.3), remove the non-present pte migration with some
additional text in the warning and do the rmap cleanup.

3. After 3 or 4 releases (and hopefully finding no real users), we
deprecate this completely.

Step 3 can be delayed if there are some users depending on it. However
we need to be firm that this is going away irrespective.

Shakeel

Hugh Dickins Nov. 30, 2022, 5:36 p.m. UTC | #10

On Wed, 30 Nov 2022, Shakeel Butt wrote:
> 
> 2. For 6.2 (or 6.3), remove the non-present pte migration with some
> additional text in the warning and do the rmap cleanup.

I just had an idea for softening the impact of that change: a moment's
more thought may prove it's a terrible idea, but right now I like it.

What if we keep the non-present pte migration throughout the deprecation
period, but with a change to the where the folio_trylock() is done, and
a refusal to move the charge on the page of a non-present pte, if that
page/folio is currently mapped anywhere else - the folio lock preventing
it from then becoming mapped while in mem_cgroup_move_account().

There's an argument that that's a better implementation anyway: that
we should not interfere with others' pages; but perhaps it would turn
out to be unimplementable, or would make for less predictable behaviour.

Hugh

Johannes Weiner Nov. 30, 2022, 10:30 p.m. UTC | #11

On Wed, Nov 30, 2022 at 09:36:15AM -0800, Hugh Dickins wrote:
> On Wed, 30 Nov 2022, Shakeel Butt wrote:
> > 
> > 2. For 6.2 (or 6.3), remove the non-present pte migration with some
> > additional text in the warning and do the rmap cleanup.
> 
> I just had an idea for softening the impact of that change: a moment's
> more thought may prove it's a terrible idea, but right now I like it.
> 
> What if we keep the non-present pte migration throughout the deprecation
> period, but with a change to the where the folio_trylock() is done, and
> a refusal to move the charge on the page of a non-present pte, if that
> page/folio is currently mapped anywhere else - the folio lock preventing
> it from then becoming mapped while in mem_cgroup_move_account().

I would like that better too. Charge moving has always been lossy
(because of trylocking the page, and having to isolate it), but
categorically leaving private swap pages behind seems like a bit much
to sneak in quietly.

> There's an argument that that's a better implementation anyway: that
> we should not interfere with others' pages; but perhaps it would turn
> out to be unimplementable, or would make for less predictable behaviour.

Hm, I think the below should work for swap pages. Do you see anything
obviously wrong with it, or scenarios I haven't considered?

@@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
         * we call find_get_page() with swapper_space directly.
         */
        page = find_get_page(swap_address_space(ent), swp_offset(ent));
+
+       /*
+        * Don't move shared charges. This isn't just for saner move
+        * semantics, it also ensures that page_mapped() is stable for
+        * the accounting in mem_cgroup_mapcount().
+        *
+        * We have to serialize against the following paths: fork
+        * (which may copy a page map or a swap pte), fault (which may
+        * change a swap pte into a page map), unmap (which may cause
+        * a page map or a swap pte to disappear), and reclaim (which
+        * may change a page map into a swap pte).
+        *
+        * - Without swapcache, we only want to move the charge if
+        *   there are no other swap ptes. With the pte lock, the
+        *   swapcount is stable against all of the above scenarios
+        *   when it's 1 (our pte), which is the case we care about.
+        *
+        * - When there is a page in swapcache, we only want to move
+        *   charges when neither the page nor the swap entry are
+        *   mapped elsewhere. The pte lock prevents our pte from
+        *   being forked or unmapped. The page lock will stop faults
+        *   against, and reclaim of, the swapcache page. So if the
+        *   page isn't mapped, and the swap count is 1 (our pte), the
+        *   test results are stable and the charge is exclusive.
+        */
+       if (!page && __swap_count(ent) != 1)
+               return NULL;
+
+       if (page) {
+               if (!trylock_page(page)) {
+                       put_page(page);
+                       return NULL;
+               }
+               if (page_mapped(page) || __swap_count(ent) != 1) {
+                       unlock_page(page);
+                       put_page(page);
+                       return NULL;
+               }
+       }
+
        entry->val = ent.val;
 
        return page;

Hugh Dickins Dec. 1, 2022, 12:13 a.m. UTC | #12

On Wed, 30 Nov 2022, Johannes Weiner wrote:
> 
> Hm, I think the below should work for swap pages. Do you see anything
> obviously wrong with it, or scenarios I haven't considered?
> 

I think you're overcomplicating it, with the __swap_count(ent) business,
and consequent unnecessarily detailed comments on the serialization.

Page/folio lock prevents a !page_mapped(page) becoming a page_mapped(page),
whether it's in swap cache or in file cache; it does not stop the sharing
count going further up, or down even to 0, but we just don't need to worry
about that sharing count - the MC_TARGET_PAGE case does not reject pages
with mapcount > 1, so why complicate the swap or file case in that way?

(Yes, it can be argued that all such sharing should be rejected; but we
didn't come here to argue improvements to memcg charge moving semantics:
just to minimize its effect on rmap, before it is fully deprecated.)

Or am I missing the point of why you add that complication?

> @@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,

Don't forget to trylock the page in the device_private case before this.

>          * we call find_get_page() with swapper_space directly.
>          */
>         page = find_get_page(swap_address_space(ent), swp_offset(ent));
> +
> +       /*
> +        * Don't move shared charges. This isn't just for saner move
> +        * semantics, it also ensures that page_mapped() is stable for
> +        * the accounting in mem_cgroup_mapcount().

mem_cgroup_mapcount()??

> +        *
> +        * We have to serialize against the following paths: fork
> +        * (which may copy a page map or a swap pte), fault (which may
> +        * change a swap pte into a page map), unmap (which may cause
> +        * a page map or a swap pte to disappear), and reclaim (which
> +        * may change a page map into a swap pte).
> +        *
> +        * - Without swapcache, we only want to move the charge if
> +        *   there are no other swap ptes. With the pte lock, the
> +        *   swapcount is stable against all of the above scenarios
> +        *   when it's 1 (our pte), which is the case we care about.
> +        *
> +        * - When there is a page in swapcache, we only want to move
> +        *   charges when neither the page nor the swap entry are
> +        *   mapped elsewhere. The pte lock prevents our pte from
> +        *   being forked or unmapped. The page lock will stop faults
> +        *   against, and reclaim of, the swapcache page. So if the
> +        *   page isn't mapped, and the swap count is 1 (our pte), the
> +        *   test results are stable and the charge is exclusive.
> +        */
> +       if (!page && __swap_count(ent) != 1)
> +               return NULL;
> +
> +       if (page) {
> +               if (!trylock_page(page)) {
> +                       put_page(page);
> +                       return NULL;
> +               }
> +               if (page_mapped(page) || __swap_count(ent) != 1) {
> +                       unlock_page(page);
> +                       put_page(page);
> +                       return NULL;
> +               }
> +       }
> +
>         entry->val = ent.val;
>  
>         return page;

Looks right, without the __swap_count() additions and swap count comments.

And similar code in mc_handle_file_pte() - or are you saying that only
swap should be handled this way?  I would disagree.

And matching trylock in mc_handle_present_pte() (and get_mctgt_type_thp()),
instead of in mem_cgroup_move_account().

I haven't checked to see where the page then needs to be unlocked,
probably some new places.

And I don't know what will be best for the preliminary precharge pass:
doesn't really want the page lock at all, but it may be unnecessary
complication to avoid taking it then unlocking it in that pass.

Hugh

Johannes Weiner Dec. 1, 2022, 3:52 p.m. UTC | #13

On Wed, Nov 30, 2022 at 04:13:23PM -0800, Hugh Dickins wrote:
> On Wed, 30 Nov 2022, Johannes Weiner wrote:
> > 
> > Hm, I think the below should work for swap pages. Do you see anything
> > obviously wrong with it, or scenarios I haven't considered?
> > 
> 
> I think you're overcomplicating it, with the __swap_count(ent) business,
> and consequent unnecessarily detailed comments on the serialization.
> 
> Page/folio lock prevents a !page_mapped(page) becoming a page_mapped(page),
> whether it's in swap cache or in file cache; it does not stop the sharing
> count going further up, or down even to 0, but we just don't need to worry
> about that sharing count - the MC_TARGET_PAGE case does not reject pages
> with mapcount > 1, so why complicate the swap or file case in that way?
> 
> (Yes, it can be argued that all such sharing should be rejected; but we
> didn't come here to argue improvements to memcg charge moving semantics:
> just to minimize its effect on rmap, before it is fully deprecated.)
> 
> Or am I missing the point of why you add that complication?

No, it just seemed odd to move shared swap *unless* it's partially
faulted. But you're right, it's probably not worth the hassle. I'll
cut this down to the page_mapped() check.

The struggle of writing code for Schroedinger's User...

> > @@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> 
> Don't forget to trylock the page in the device_private case before this.

Yep, thanks!

> >          * we call find_get_page() with swapper_space directly.
> >          */
> >         page = find_get_page(swap_address_space(ent), swp_offset(ent));
> > +
> > +       /*
> > +        * Don't move shared charges. This isn't just for saner move
> > +        * semantics, it also ensures that page_mapped() is stable for
> > +        * the accounting in mem_cgroup_mapcount().
> 
> mem_cgroup_mapcount()??

mem_cgroup_move_account() of course! Will fix.

> > +        * We have to serialize against the following paths: fork
> > +        * (which may copy a page map or a swap pte), fault (which may
> > +        * change a swap pte into a page map), unmap (which may cause
> > +        * a page map or a swap pte to disappear), and reclaim (which
> > +        * may change a page map into a swap pte).
> > +        *
> > +        * - Without swapcache, we only want to move the charge if
> > +        *   there are no other swap ptes. With the pte lock, the
> > +        *   swapcount is stable against all of the above scenarios
> > +        *   when it's 1 (our pte), which is the case we care about.
> > +        *
> > +        * - When there is a page in swapcache, we only want to move
> > +        *   charges when neither the page nor the swap entry are
> > +        *   mapped elsewhere. The pte lock prevents our pte from
> > +        *   being forked or unmapped. The page lock will stop faults
> > +        *   against, and reclaim of, the swapcache page. So if the
> > +        *   page isn't mapped, and the swap count is 1 (our pte), the
> > +        *   test results are stable and the charge is exclusive.

... and edit this down accordingly.

> > +        */
> > +       if (!page && __swap_count(ent) != 1)
> > +               return NULL;
> > +
> > +       if (page) {
> > +               if (!trylock_page(page)) {
> > +                       put_page(page);
> > +                       return NULL;
> > +               }
> > +               if (page_mapped(page) || __swap_count(ent) != 1) {
> > +                       unlock_page(page);
> > +                       put_page(page);
> > +                       return NULL;
> > +               }
> > +       }
> > +
> >         entry->val = ent.val;
> >  
> >         return page;
> 
> Looks right, without the __swap_count() additions and swap count comments.
> 
> And similar code in mc_handle_file_pte() - or are you saying that only
> swap should be handled this way?  I would disagree.

Right, same rules apply there. I only pasted the swap one to make sure
we get aligned on the basic strategy.

> And matching trylock in mc_handle_present_pte() (and get_mctgt_type_thp()),
> instead of in mem_cgroup_move_account().

Yes.

> I haven't checked to see where the page then needs to be unlocked,
> probably some new places.

Yes, the callers of get_mctgt_type*() need to unlock (if target is
passed and the page is returned). It looks straight-forward, they
already have to do put_page().

> And I don't know what will be best for the preliminary precharge pass:
> doesn't really want the page lock at all, but it may be unnecessary
> complication to avoid taking it then unlocking it in that pass.

We could make it conditional on target, which precharge doesn't pass,
but I agree it's likely not worth optimizing that code at this point.

Thanks for taking a look, Hugh, that's excellent input.

I'll finish this patch, rebase the rmap patch on it, and add a new one
to issue a deprecation warning in mem_cgroup_move_charge_write().

Johannes

Hugh Dickins Dec. 1, 2022, 7:28 p.m. UTC | #14

On Thu, 1 Dec 2022, Johannes Weiner wrote:
> On Wed, Nov 30, 2022 at 04:13:23PM -0800, Hugh Dickins wrote:
> 
> > And I don't know what will be best for the preliminary precharge pass:
> > doesn't really want the page lock at all, but it may be unnecessary
> > complication to avoid taking it then unlocking it in that pass.
> 
> We could make it conditional on target, which precharge doesn't pass,
> but I agree it's likely not worth optimizing that code at this point.

I hadn't noticed that NULL target so easily distinguishes that case:
unless it goes on to uglify things more (which I think it won't, seeing
now how there are already plenty of conditionals on target), I would
prefer to avoid the trylock and unlock in the precharge case; but
decide for yourself.

Hugh

mm: remove lock_page_memcg() from rmap

Commit Message

Comments

Patch