mm/memcg: fix device private memcg accounting

Message ID	20201009215952.2726-1-rcampbell@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=0lZN=DQ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B5E42225B TLS: TLSv1.2, AES256-SHA) id <B5f80dd590003>; Fri, 09 Oct 2020 14:59:53 -0700 From: Ralph Campbell <rcampbell@nvidia.com> To: <linux-mm@kvack.org>, <cgroups@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Balbir Singh <bsingharora@gmail.com>, Ira Weiny <ira.weiny@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Ralph Campbell <rcampbell@nvidia.com>, <stable@vger.kernel.org> Subject: [PATCH] mm/memcg: fix device private memcg accounting Date: Fri, 9 Oct 2020 14:59:52 -0700 Message-ID: <20201009215952.2726-1-rcampbell@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/memcg: fix device private memcg accounting \| expand mm/memcg: fix device private memcg accounting

Message ID

20201009215952.2726-1-rcampbell@nvidia.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B5E42225B
From: Ralph Campbell <rcampbell@nvidia.com>
To: <linux-mm@kvack.org>, <cgroups@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
CC: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>,
 =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>,
 Balbir Singh <bsingharora@gmail.com>, Ira Weiny <ira.weiny@intel.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Ralph Campbell <rcampbell@nvidia.com>, <stable@vger.kernel.org>
Subject: [PATCH] mm/memcg: fix device private memcg accounting
Date: Fri, 9 Oct 2020 14:59:52 -0700
Message-ID: <20201009215952.2726-1-rcampbell@nvidia.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm/memcg: fix device private memcg accounting | expand

Commit Message

Ralph Campbell Oct. 9, 2020, 9:59 p.m. UTC

The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages
are never handled.
Fix this by checking for non_swap_entry() after handling device private
swap PTEs.

Cc: stable@vger.kernel.org
Fixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---

I'm not sure exactly how to test this. I ran the HMM self tests but
that is a minimal sanity check. I think moving the self test from one
memory cgroup to another while it is running would exercise this patch.
I'm looking at how the test could move itself to another group after
migrating some anonymous memory to the test driver.

This applies cleanly to linux-5.9.0-rc8-mm1 and is for Andrew Morton's
tree.

 mm/memcontrol.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Andrew Morton Oct. 9, 2020, 10:50 p.m. UTC | #1

On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:

> The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
> NULL before checking is_device_private_entry() so device private pages
> are never handled.
> Fix this by checking for non_swap_entry() after handling device private
> swap PTEs.
> 
> Cc: stable@vger.kernel.org

I was going to ask "what are the end-user visible effects of the bug". 
This is important information with a cc:stable.

> 
> I'm not sure exactly how to test this. I ran the HMM self tests but
> that is a minimal sanity check. I think moving the self test from one
> memory cgroup to another while it is running would exercise this patch.
> I'm looking at how the test could move itself to another group after
> migrating some anonymous memory to the test driver.
> 

But this makes me suspect the answer is "there aren't any that we know
of".  Are you sure a cc:stable is warranted?

Ralph Campbell Oct. 10, 2020, midnight UTC | #2

On 10/9/20 3:50 PM, Andrew Morton wrote:
> On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> 
>> The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
>> NULL before checking is_device_private_entry() so device private pages
>> are never handled.
>> Fix this by checking for non_swap_entry() after handling device private
>> swap PTEs.
>>
>> Cc: stable@vger.kernel.org
> 
> I was going to ask "what are the end-user visible effects of the bug".
> This is important information with a cc:stable.
> 
>>
>> I'm not sure exactly how to test this. I ran the HMM self tests but
>> that is a minimal sanity check. I think moving the self test from one
>> memory cgroup to another while it is running would exercise this patch.
>> I'm looking at how the test could move itself to another group after
>> migrating some anonymous memory to the test driver.
>>
> 
> But this makes me suspect the answer is "there aren't any that we know
> of".  Are you sure a cc:stable is warranted?
> 

I assume the memory cgroup accounting would be off somehow when moving
a process to another memory cgroup.
Currently, the device private page is charged like a normal anonymous page
when allocated and is uncharged when the page is freed so I think that path is OK.
Maybe someone who knows more about memory cgroup accounting can comment?

Johannes Weiner Oct. 12, 2020, 1:28 p.m. UTC | #3

On Fri, Oct 09, 2020 at 05:00:37PM -0700, Ralph Campbell wrote:
> 
> On 10/9/20 3:50 PM, Andrew Morton wrote:
> > On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> > 
> > > The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
> > > NULL before checking is_device_private_entry() so device private pages
> > > are never handled.
> > > Fix this by checking for non_swap_entry() after handling device private
> > > swap PTEs.

The fix looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> > But this makes me suspect the answer is "there aren't any that we know
> > of".  Are you sure a cc:stable is warranted?
> > 
> 
> I assume the memory cgroup accounting would be off somehow when moving
> a process to another memory cgroup.
> Currently, the device private page is charged like a normal anonymous page
> when allocated and is uncharged when the page is freed so I think that path is OK.
> Maybe someone who knows more about memory cgroup accounting can comment?

As for whether to CC stable, I'm leaning toward no:

- When moving tasks, we'd leave their device pages behind in the old
  cgroup. This isn't great, but it doesn't cause counter imbalances or
  corruption or anything - we also skip locked pages, we used to skip
  pages mapped by more than one pte, the user can select whether to
  move pages along tasks at all, and if so, whether only anon or file.

- Charge moving itself is a bit of a questionable feature, and users
  have been moving away from it. Leaving tasks in a cgroup and
  changing the configuration is a heck of a lot cheaper than moving
  potentially gigabytes of pages to another configuration domain.

- According to the Fixes tag, this isn't a regression, either. Since
  their inception, we have never migrated device pages.

Ralph Campbell Oct. 12, 2020, 5:11 p.m. UTC | #4

On 10/12/20 6:28 AM, Johannes Weiner wrote:
> On Fri, Oct 09, 2020 at 05:00:37PM -0700, Ralph Campbell wrote:
>>
>> On 10/9/20 3:50 PM, Andrew Morton wrote:
>>> On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
>>>
>>>> The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
>>>> NULL before checking is_device_private_entry() so device private pages
>>>> are never handled.
>>>> Fix this by checking for non_swap_entry() after handling device private
>>>> swap PTEs.
> 
> The fix looks good to me.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
>>> But this makes me suspect the answer is "there aren't any that we know
>>> of".  Are you sure a cc:stable is warranted?
>>>
>>
>> I assume the memory cgroup accounting would be off somehow when moving
>> a process to another memory cgroup.
>> Currently, the device private page is charged like a normal anonymous page
>> when allocated and is uncharged when the page is freed so I think that path is OK.
>> Maybe someone who knows more about memory cgroup accounting can comment?
> 
> As for whether to CC stable, I'm leaning toward no:
> 
> - When moving tasks, we'd leave their device pages behind in the old
>    cgroup. This isn't great, but it doesn't cause counter imbalances or
>    corruption or anything - we also skip locked pages, we used to skip
>    pages mapped by more than one pte, the user can select whether to
>    move pages along tasks at all, and if so, whether only anon or file.
> 
> - Charge moving itself is a bit of a questionable feature, and users
>    have been moving away from it. Leaving tasks in a cgroup and
>    changing the configuration is a heck of a lot cheaper than moving
>    potentially gigabytes of pages to another configuration domain.
> 
> - According to the Fixes tag, this isn't a regression, either. Since
>    their inception, we have never migrated device pages.

Thanks for the Acked-by and the comments.
I assume Andrew will update the tags when queuing this up unless he wants me to resend.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..3a24e3b619f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5549,7 +5549,7 @@  static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	swp_entry_t ent = pte_to_swp_entry(ptent);
 
-	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
+	if (!(mc.flags & MOVE_ANON))
 		return NULL;
 
 	/*
@@ -5568,6 +5568,9 @@  static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 		return page;
 	}
 
+	if (non_swap_entry(ent))
+		return NULL;
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.

mm/memcg: fix device private memcg accounting

Commit Message

Comments

Patch