mm: Skip opportunistic reclaim for dma pinned pages

Message ID	20200624191417.16735-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=FJBn=AF=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 37E3320702 From: Chris Wilson <chris@chris-wilson.co.uk> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, intel-gfx@lists.freedesktop.org, Chris Wilson <chris@chris-wilson.co.uk>, Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.cz>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Jason Gunthorpe <jgg@ziepe.ca> Subject: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages Date: Wed, 24 Jun 2020 20:14:17 +0100 Message-Id: <20200624191417.16735-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Skip opportunistic reclaim for dma pinned pages \| expand mm: Skip opportunistic reclaim for dma pinned pages

Chris Wilson June 24, 2020, 7:14 p.m. UTC

A general rule of thumb is that shrinkers should be fast and effective.
They are called from direct reclaim at the most incovenient of times when
the caller is waiting for a page. If we attempt to reclaim a page being
pinned for active dma [pin_user_pages()], we will incur far greater
latency than a normal anonymous page mapped multiple times. Worse the
page may be in use indefinitely by the HW and unable to be reclaimed
in a timely manner.

A side effect of the LRU shrinker not being dma aware is that we will
often attempt to perform direct reclaim on the persistent group of dma
pages while continuing to use the dma HW (an issue as the HW may already
be actively waiting for the next user request), and even attempt to
reclaim a partially allocated dma object in order to satisfy pinning
the next user page for that object.

It is to be expected that such pages are made available for reclaim at
the end of the dma operation [unpin_user_pages()], and for truly
longterm pins to be proactively recovered via device specific shrinkers
[i.e. stop the HW, allow the pages to be returned to the system, and
then compete again for the memory].

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
---
This seems perhaps a little devious and overzealous. Is there a more
appropriate TTU flag? Would there be a way to limit its effect to say
FOLL_LONGTERM? Doing the migration first would seem to be sensible if
we disable opportunistic migration for the duration of the pin.
---
 mm/rmap.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Jason Gunthorpe June 24, 2020, 7:21 p.m. UTC | #1

On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> A general rule of thumb is that shrinkers should be fast and effective.
> They are called from direct reclaim at the most incovenient of times when
> the caller is waiting for a page. If we attempt to reclaim a page being
> pinned for active dma [pin_user_pages()], we will incur far greater
> latency than a normal anonymous page mapped multiple times. Worse the
> page may be in use indefinitely by the HW and unable to be reclaimed
> in a timely manner.

A pinned page can't be migrated, discarded or swapped by definition -
it would cause data corruption.

So, how do things even get here and/or work today at all? I think the
explanation is missing something important.

Jason

Yang Shi June 24, 2020, 8:23 p.m. UTC | #2

On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A general rule of thumb is that shrinkers should be fast and effective.
> > They are called from direct reclaim at the most incovenient of times when
> > the caller is waiting for a page. If we attempt to reclaim a page being
> > pinned for active dma [pin_user_pages()], we will incur far greater
> > latency than a normal anonymous page mapped multiple times. Worse the
> > page may be in use indefinitely by the HW and unable to be reclaimed
> > in a timely manner.
>
> A pinned page can't be migrated, discarded or swapped by definition -
> it would cause data corruption.
>
> So, how do things even get here and/or work today at all? I think the
> explanation is missing something important.

The __remove_mapping() will try to freeze page count if the count is
expected otherwise just not discard the page. I'm not quite sure why
the check is done that late, my wild guess is to check the refcount at
the last minute so there might be a chance the pin gets released right
before it.

But I noticed a bug in __remove_ampping() for THP since THP's dma
pinned count is recorded in the tail page's hpage_pinned_refcount
instead of refcount. So, the refcount freeze might be successful for
pinned THP.  Chris's patch could solve this issue too, but I'm not
sure if it is worth backing earlier once dma pinned page is met. If it
is worth, the follow-up question is why not just skip such page in
scan phase?

>
> Jason
>

Chris Wilson June 24, 2020, 8:23 p.m. UTC | #3

Quoting Jason Gunthorpe (2020-06-24 20:21:16)
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A general rule of thumb is that shrinkers should be fast and effective.
> > They are called from direct reclaim at the most incovenient of times when
> > the caller is waiting for a page. If we attempt to reclaim a page being
> > pinned for active dma [pin_user_pages()], we will incur far greater
> > latency than a normal anonymous page mapped multiple times. Worse the
> > page may be in use indefinitely by the HW and unable to be reclaimed
> > in a timely manner.
> 
> A pinned page can't be migrated, discarded or swapped by definition -
> it would cause data corruption.
> 
> So, how do things even get here and/or work today at all? I think the
> explanation is missing something important.

[<0>] userptr_mn_invalidate_range_start+0xa7/0x170 [i915]
[<0>] __mmu_notifier_invalidate_range_start+0x110/0x150
[<0>] try_to_unmap_one+0x790/0x870
[<0>] rmap_walk_file+0xe9/0x230
[<0>] rmap_walk+0x27/0x30
[<0>] try_to_unmap+0x89/0xc0
[<0>] shrink_page_list+0x88a/0xf50
[<0>] shrink_inactive_list+0x137/0x2f0
[<0>] shrink_lruvec+0x4ec/0x5f0
[<0>] shrink_node+0x15d/0x410
[<0>] try_to_free_pages+0x17f/0x430
[<0>] __alloc_pages_slowpath+0x2ab/0xcc0
[<0>] __alloc_pages_nodemask+0x1ad/0x1e0
[<0>] new_slab+0x2d8/0x310
[<0>] ___slab_alloc.constprop.0+0x288/0x520
[<0>] __slab_alloc.constprop.0+0xd/0x20
[<0>] kmem_cache_alloc_trace+0x1ad/0x1c0

and that hits an active pin_user_pages object.

Is there some information that would help in particular?
-Chris

John Hubbard June 24, 2020, 8:47 p.m. UTC | #4

On 2020-06-24 12:21, Jason Gunthorpe wrote:
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
>> A general rule of thumb is that shrinkers should be fast and effective.
>> They are called from direct reclaim at the most incovenient of times when
>> the caller is waiting for a page. If we attempt to reclaim a page being
>> pinned for active dma [pin_user_pages()], we will incur far greater
>> latency than a normal anonymous page mapped multiple times. Worse the
>> page may be in use indefinitely by the HW and unable to be reclaimed
>> in a timely manner.
> 
> A pinned page can't be migrated, discarded or swapped by definition -
> it would cause data corruption.
> 
> So, how do things even get here and/or work today at all? I think the
> explanation is missing something important.
> 

Well, those activities generally try to unmap the page, and
have to be prepared to deal with failure to unmap. From my reading,
it seemed very clear.

What's less clear is why the comment and the commit description
only talk about reclaim, when there are additional things that call
try_to_unmap(), including:

     migrate_vma_unmap()
     split_huge_page_to_list() --> unmap_page()

I do like this code change, though. And I *think* it's actually safe to
do this, as it stays away from writeback or other filesystem activity.
But let me double check that, in case I'm forgetting something.

thanks,

Yang Shi June 24, 2020, 9:02 p.m. UTC | #5

On Wed, Jun 24, 2020 at 1:23 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> >
> > A pinned page can't be migrated, discarded or swapped by definition -
> > it would cause data corruption.
> >
> > So, how do things even get here and/or work today at all? I think the
> > explanation is missing something important.
>
> The __remove_mapping() will try to freeze page count if the count is
> expected otherwise just not discard the page. I'm not quite sure why
> the check is done that late, my wild guess is to check the refcount at
> the last minute so there might be a chance the pin gets released right
> before it.
>
> But I noticed a bug in __remove_ampping() for THP since THP's dma
> pinned count is recorded in the tail page's hpage_pinned_refcount
> instead of refcount. So, the refcount freeze might be successful for
> pinned THP.  Chris's patch could solve this issue too, but I'm not

This bug is not valid. I just realized try_grab_page() would increase
both refcount and hpage_pinned_refcount.

> sure if it is worth backing earlier once dma pinned page is met. If it
> is worth, the follow-up question is why not just skip such page in
> scan phase?
>
> >
> > Jason
> >

Jason Gunthorpe June 24, 2020, 11:20 p.m. UTC | #6

On Wed, Jun 24, 2020 at 01:47:23PM -0700, John Hubbard wrote:
> On 2020-06-24 12:21, Jason Gunthorpe wrote:
> > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> > 
> > A pinned page can't be migrated, discarded or swapped by definition -
> > it would cause data corruption.
> > 
> > So, how do things even get here and/or work today at all? I think the
> > explanation is missing something important.
> > 
> 
> Well, those activities generally try to unmap the page, and
> have to be prepared to deal with failure to unmap. From my reading,
> it seemed very clear.

I think Yang explained it - the page is removed from the mappings but
freeing it does not happen because page_ref_freeze() does not succeed
due to the pin.

Presumably the mappings can reconnect to the same physical page if
it is re-faulted to avoid any data corruption.

So, the issue here is the mappings are trashed while the page remains
- and trashing the mapping triggers a mmu notifier which upsets i915.

> What's less clear is why the comment and the commit description
> only talk about reclaim, when there are additional things that call
> try_to_unmap(), including:
> 
>     migrate_vma_unmap()
>     split_huge_page_to_list() --> unmap_page()

It looks like the same unmap first then abort if the refcount is still
elevated design as shrink_page_list() ?

> I do like this code change, though. And I *think* it's actually safe to
> do this, as it stays away from writeback or other filesystem activity.
> But let me double check that, in case I'm forgetting something.

It would be nice to have an explanation why it is OK now to change
it..

I don't know, but could it be that try_to_unmap() has to be done
before checking the refcount as each mapping is included in the
refcount? ie we couldn't know a DMA pin was active in advance?

Now that we have your pin stuff we can detect a DMA pin without doing
all the unmaps?

Jason

John Hubbard June 25, 2020, 12:11 a.m. UTC | #7

On 2020-06-24 16:20, Jason Gunthorpe wrote:
...
> I think Yang explained it - the page is removed from the mappings but
> freeing it does not happen because page_ref_freeze() does not succeed
> due to the pin.
> 
> Presumably the mappings can reconnect to the same physical page if
> it is re-faulted to avoid any data corruption.
> 
> So, the issue here is the mappings are trashed while the page remains
> - and trashing the mapping triggers a mmu notifier which upsets i915.
> 
>> What's less clear is why the comment and the commit description
>> only talk about reclaim, when there are additional things that call
>> try_to_unmap(), including:
>>
>>      migrate_vma_unmap()
>>      split_huge_page_to_list() --> unmap_page()
> 
> It looks like the same unmap first then abort if the refcount is still
> elevated design as shrink_page_list() ?

Yes. I was just wondering why the documentation here seems to ignore the
other, non-reclaim cases. Anyway, though...

> 
>> I do like this code change, though. And I *think* it's actually safe to
>> do this, as it stays away from writeback or other filesystem activity.
>> But let me double check that, in case I'm forgetting something.

...OK, I've checked, and I like it a little bit less now. Mainly for
structural reasons, though. I think it would work correctly. But
here's a concern: try_to_unmap() should only fail to unmap if there is a
reason to not unmap. Having a page be pinned for dma is a reason to not
*free* a page, and it's also a reason to be careful about writeback and
page buffers for writeback and such. But I'm not sure that it's a reason
to fail to remove mappings.

True, most (all?) of the reasons that we remove mappings, generally are
for things that are not allowed while a page is dma-pinned...at least,
today. But still, there's nothing fundamental about a mapping that
should prevent it from coming or going while a page is undergoing
dma.

So, it's merely a convenient, now-misnamed location in the call stack
to fail out. That's not great. It might be better, as Jason hints at
below, to fail out a little earlier, instead. That would lead to a more
places to call page_maybe_dma_pinned(), but that's not a real problem,
because it's still a small number of places.

After writing all of that...I don't feel strongly about it, because
TTU is kind of synonymous with "I'm about to do a dma-pin-unfriendly
operation".

Maybe some of the more experienced fs or mm people have strong opinions
one way or the other?

> 
> It would be nice to have an explanation why it is OK now to change
> it..

Yes. Definitely good to explain that in the commit log. I think
it's triggered by the existence of page_maybe_dma_pinned(). Until
that was added, figuring out if dma was involved required basically
just guesswork. Now we have a way to guess much more accurately. :)

> 
> I don't know, but could it be that try_to_unmap() has to be done
> before checking the refcount as each mapping is included in the
> refcount? ie we couldn't know a DMA pin was active in advance?
> 
> Now that we have your pin stuff we can detect a DMA pin without doing
> all the unmaps?
> 

Once something calls pin_user_page*(), then the pages will be marked
as dma-pinned, yes. So no, there is no need to wait until try_to_unmap()
to find out.

A final note: depending on where page_maybe_dma_pinned() ends up
getting called, this might prevent a fair number of the problems that
Jan originally reported [1], and that I also reported separately!

Well, not all of the problems, and only after the filesystems get
converted to call pin_user_pages() (working on that next), but...I think
it would actually avoid the crash our customer reported back in early
2018. Even though we don't have the full file lease + pin_user_pages()
solution in place.

That's because reclaim is what triggers the problems that we saw. And
with this patch, we bail out of reclaim early.

[1] https://www.spinics.net/lists/linux-mm/msg142700.html

thanks,

Michal Hocko June 25, 2020, 7:57 a.m. UTC | #8

On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> A general rule of thumb is that shrinkers should be fast and effective.
> They are called from direct reclaim at the most incovenient of times when
> the caller is waiting for a page. If we attempt to reclaim a page being
> pinned for active dma [pin_user_pages()], we will incur far greater
> latency than a normal anonymous page mapped multiple times. Worse the
> page may be in use indefinitely by the HW and unable to be reclaimed
> in a timely manner.
> 
> A side effect of the LRU shrinker not being dma aware is that we will
> often attempt to perform direct reclaim on the persistent group of dma
> pages while continuing to use the dma HW (an issue as the HW may already
> be actively waiting for the next user request), and even attempt to
> reclaim a partially allocated dma object in order to satisfy pinning
> the next user page for that object.

You are talking about direct reclaim but this path is shared with the
background reclaim. This is a bit confusing. Maybe you just want to
outline the latency in the reclaim which is more noticeable in the
direct reclaim to the userspace. This would be good to be clarified.

How much memory are we talking about here btw?

> It is to be expected that such pages are made available for reclaim at
> the end of the dma operation [unpin_user_pages()], and for truly
> longterm pins to be proactively recovered via device specific shrinkers
> [i.e. stop the HW, allow the pages to be returned to the system, and
> then compete again for the memory].

Is the later implemented?

Btw. overall intention of the patch is not really clear to me. Do I get
it right that this is going to reduce latency of the reclaim for pages
that are not reclaimable anyway because they are pinned? If yes do we
have any numbers for that.

It would be also good to explain why the bail out is implemented in
try_to_unmap rather than shrink_shrink_page_list.

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> ---
> This seems perhaps a little devious and overzealous. Is there a more
> appropriate TTU flag? Would there be a way to limit its effect to say
> FOLL_LONGTERM? Doing the migration first would seem to be sensible if
> we disable opportunistic migration for the duration of the pin.
> ---
>  mm/rmap.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5fe2dedce1fc..374c6e65551b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1393,6 +1393,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	    is_zone_device_page(page) && !is_device_private_page(page))
>  		return true;
>  
> +	/*
> +	 * Try and fail early to revoke a costly DMA pinned page.
> +	 *
> +	 * Reclaiming an active DMA page requires stopping the hardware
> +	 * and flushing access. [Hardware that does support pagefaulting,
> +	 * and so can quickly revoke DMA pages at any time, does not need
> +	 * to pin the DMA page.] At worst, the page may be indefinitely in
> +	 * use by the hardware. Even at best it will take far longer to
> +	 * revoke the access via the mmu notifier, forcing that latency
> +	 * onto our callers rather than the consumer of the HW. As we are
> +	 * called during opportunistic direct reclaim, declare the
> +	 * opportunity cost too high and ignore the page.
> +	 */
> +	if (page_maybe_dma_pinned(page))
> +		return true;

I do not understand why the page table walk needs to be done. The page
is going to be pinned no matter how many page tables are mapping it
right?

> +
>  	if (flags & TTU_SPLIT_HUGE_PMD) {
>  		split_huge_pmd_address(vma, address,
>  				flags & TTU_SPLIT_FREEZE, page);
> -- 
> 2.20.1

Chris Wilson June 25, 2020, 11 a.m. UTC | #9

Quoting Michal Hocko (2020-06-25 08:57:25)
> On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> > A general rule of thumb is that shrinkers should be fast and effective.
> > They are called from direct reclaim at the most incovenient of times when
> > the caller is waiting for a page. If we attempt to reclaim a page being
> > pinned for active dma [pin_user_pages()], we will incur far greater
> > latency than a normal anonymous page mapped multiple times. Worse the
> > page may be in use indefinitely by the HW and unable to be reclaimed
> > in a timely manner.
> > 
> > A side effect of the LRU shrinker not being dma aware is that we will
> > often attempt to perform direct reclaim on the persistent group of dma
> > pages while continuing to use the dma HW (an issue as the HW may already
> > be actively waiting for the next user request), and even attempt to
> > reclaim a partially allocated dma object in order to satisfy pinning
> > the next user page for that object.
> 
> You are talking about direct reclaim but this path is shared with the
> background reclaim. This is a bit confusing. Maybe you just want to
> outline the latency in the reclaim which is more noticeable in the
> direct reclaim to the userspace. This would be good to be clarified.
> 
> How much memory are we talking about here btw?

It depends. In theory, it is used sparingly. But it is under userspace
control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If
all goes to plan the application memory is only pinned for as long as the
HW is using it, but that is an indefinite period of time and an indefinite
amount of memory. There are provisions in place to impose upper limits
on how long an operation can last on the HW, and the mmu-notifier is
there to ensure we do unpin the memory on demand. However cancelling a
HW operation (which will result in data loss and often process
termination due to an unfortunate sequence of events when userspace
fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not
a good choice.

> > It is to be expected that such pages are made available for reclaim at
> > the end of the dma operation [unpin_user_pages()], and for truly
> > longterm pins to be proactively recovered via device specific shrinkers
> > [i.e. stop the HW, allow the pages to be returned to the system, and
> > then compete again for the memory].
> 
> Is the later implemented?

Depends on driver, for i915 we had a shrinker since before we introduced
get_user_pages objects. We have the same problem with trying to mitigate
userspace wanting to use all of memory for a single operation, whether
it's from shmemfs or get_user_pages.

> Btw. overall intention of the patch is not really clear to me. Do I get
> it right that this is going to reduce latency of the reclaim for pages
> that are not reclaimable anyway because they are pinned? If yes do we
> have any numbers for that.

I can plug it into a microbenchmark ala cycletest to show the impact...
Memory filled with 64M gup objects, random utilisation of those with
the GPU; background process filling the pagecache with find /; reporting
the time difference from the expected expiry of a timer with the actual:
[On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each
measuring mean latency for 20s -- mean is probably a really bad choice
here, we need 50/90/95/99]

direct reclaim calling mmu-notifier:
gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us

skipping try_to_unmap_one with page_maybe_dma_pinned:
gem_syslatency: cycles=1965, latency mean=597.971us max=28462us

Baseline (background find /; application touched all memory, but no HW
ops)
gem_syslatency: cycles=0, latency mean=6.695us max=77us

Compare with the time to allocate a single THP against load:

Baseline:
gem_syslatency: cycles=0, latency mean=1541.562us max=52196us
Direct reclaim calling mmu-notifier:
gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us
page_maybe_dma_pinned skip:
gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us

Take with a massive pinch of salt. I expect, once I find the right
sequence, to reliably control the induced latency on the RT thread.

But first, I have to look at why there's a correlation with HW load and
timer latency, even with steady state usage. That's quite surprising --
ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan
every request submitted to HW. Just great.

With PREEMPT:
Timer:
Base:    gem_syslatency: cycles=0, latency mean=8.823us max=83us
Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us
Skip:    gem_syslatency: cycles=2677, latency mean=70.306us max=4720us

THP:
Base:    gem_syslatency: cycles=0, latency mean=1993.693us max=201958us
Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us
Skip:    gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us

Earlier caveats notwithstanding; confidence in results still low.

And refine the testing somewhat, if at the very least gather enough
samples for credible statistics.

> It would be also good to explain why the bail out is implemented in
> try_to_unmap rather than shrink_shrink_page_list.

I'm in the process of working up the chain, having started with trying to
circumvent the wait for reclaim in the mmu notifier callback in the
driver.
-Chris

Jan Kara June 25, 2020, 11:24 a.m. UTC | #10

On Wed 24-06-20 17:11:30, John Hubbard wrote:
> On 2020-06-24 16:20, Jason Gunthorpe wrote:
> > > I do like this code change, though. And I *think* it's actually safe to
> > > do this, as it stays away from writeback or other filesystem activity.
> > > But let me double check that, in case I'm forgetting something.
> 
> ...OK, I've checked, and I like it a little bit less now. Mainly for
> structural reasons, though. I think it would work correctly. But
> here's a concern: try_to_unmap() should only fail to unmap if there is a
> reason to not unmap. Having a page be pinned for dma is a reason to not
> *free* a page, and it's also a reason to be careful about writeback and
> page buffers for writeback and such. But I'm not sure that it's a reason
> to fail to remove mappings.
> 
> True, most (all?) of the reasons that we remove mappings, generally are
> for things that are not allowed while a page is dma-pinned...at least,
> today. But still, there's nothing fundamental about a mapping that
> should prevent it from coming or going while a page is undergoing
> dma.
> 
> So, it's merely a convenient, now-misnamed location in the call stack
> to fail out. That's not great. It might be better, as Jason hints at
> below, to fail out a little earlier, instead. That would lead to a more
> places to call page_maybe_dma_pinned(), but that's not a real problem,
> because it's still a small number of places.

Agreed, I think that shrink_page_list() as Michal writes is a better place
for the page_may_be_dma_pinned() check. But other than that I agree it is
good to avoid all the unnecessary work of preparing a page for reclaim when
we can now check there's no hope of reclaiming it (at this time).

> > I don't know, but could it be that try_to_unmap() has to be done
> > before checking the refcount as each mapping is included in the
> > refcount? ie we couldn't know a DMA pin was active in advance?
> > 
> > Now that we have your pin stuff we can detect a DMA pin without doing
> > all the unmaps?
> > 
> 
> Once something calls pin_user_page*(), then the pages will be marked
> as dma-pinned, yes. So no, there is no need to wait until try_to_unmap()
> to find out.

Yeah, I'm pretty sure the page ref check happens so late because earlier it
was impossible to tell whether there are "external" page references reclaim
cannot get rid of - filesystem may (but need not!) hold a page reference
for its private data, page tables will hold references as well, etc. Now
with page_may_be_dma_pinned() we can detect at least one class of external
references (i.e. pins) early so it's stupid not to do that.

> A final note: depending on where page_maybe_dma_pinned() ends up
> getting called, this might prevent a fair number of the problems that
> Jan originally reported [1], and that I also reported separately!
> 
> Well, not all of the problems, and only after the filesystems get
> converted to call pin_user_pages() (working on that next), but...I think
> it would actually avoid the crash our customer reported back in early
> 2018. Even though we don't have the full file lease + pin_user_pages()
> solution in place.
> 
> That's because reclaim is what triggers the problems that we saw. And
> with this patch, we bail out of reclaim early.

I agree that with this change, some races will become much less likely for
some usecases. But as you say, it's not a full solution.

								Honza

Matthew Wilcox June 25, 2020, 11:42 a.m. UTC | #11

On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> A side effect of the LRU shrinker not being dma aware is that we will
> often attempt to perform direct reclaim on the persistent group of dma
> pages while continuing to use the dma HW (an issue as the HW may already
> be actively waiting for the next user request), and even attempt to
> reclaim a partially allocated dma object in order to satisfy pinning
> the next user page for that object.
> 
> It is to be expected that such pages are made available for reclaim at
> the end of the dma operation [unpin_user_pages()], and for truly
> longterm pins to be proactively recovered via device specific shrinkers
> [i.e. stop the HW, allow the pages to be returned to the system, and
> then compete again for the memory].

Why are DMA pinned pages still on the LRU list at all?  I never got an
answer to this that made sense to me.  By definition, a page which is
pinned for DMA is being accessed, and needs to at the very least change
position on the LRU list, so just take it off the list when DMA-pinned
and put it back on the list when DMA-unpinned.

This overly complex lease stuff must have some reason for existing, but
I still don't get it.

Jan Kara June 25, 2020, 1:40 p.m. UTC | #12

On Thu 25-06-20 12:42:09, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A side effect of the LRU shrinker not being dma aware is that we will
> > often attempt to perform direct reclaim on the persistent group of dma
> > pages while continuing to use the dma HW (an issue as the HW may already
> > be actively waiting for the next user request), and even attempt to
> > reclaim a partially allocated dma object in order to satisfy pinning
> > the next user page for that object.
> > 
> > It is to be expected that such pages are made available for reclaim at
> > the end of the dma operation [unpin_user_pages()], and for truly
> > longterm pins to be proactively recovered via device specific shrinkers
> > [i.e. stop the HW, allow the pages to be returned to the system, and
> > then compete again for the memory].
> 
> Why are DMA pinned pages still on the LRU list at all?  I never got an
> answer to this that made sense to me.  By definition, a page which is
> pinned for DMA is being accessed, and needs to at the very least change
> position on the LRU list, so just take it off the list when DMA-pinned
> and put it back on the list when DMA-unpinned.

Well, we do mark_page_accessed() when pinning in GUP. This is not perfect
but it's as good as it gets with CPU having no control when the page is
actually accessed. Also taking the page off and then back to LRU list would
increase the contention on the LRU list locks and generally cost
performance so for short term pins it is not desirable... Otherwise I agree
that conceptually it would make some sence although I'm not sure some
places wouldn't get confused by e.g. page cache pages not being on LRU
list. 

								Honza

Michal Hocko June 25, 2020, 3:12 p.m. UTC | #13

On Thu 25-06-20 12:00:47, Chris Wilson wrote:
> Quoting Michal Hocko (2020-06-25 08:57:25)
> > On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> > > 
> > > A side effect of the LRU shrinker not being dma aware is that we will
> > > often attempt to perform direct reclaim on the persistent group of dma
> > > pages while continuing to use the dma HW (an issue as the HW may already
> > > be actively waiting for the next user request), and even attempt to
> > > reclaim a partially allocated dma object in order to satisfy pinning
> > > the next user page for that object.
> > 
> > You are talking about direct reclaim but this path is shared with the
> > background reclaim. This is a bit confusing. Maybe you just want to
> > outline the latency in the reclaim which is more noticeable in the
> > direct reclaim to the userspace. This would be good to be clarified.
> > 
> > How much memory are we talking about here btw?
> 
> It depends. In theory, it is used sparingly. But it is under userspace
> control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If
> all goes to plan the application memory is only pinned for as long as the
> HW is using it, but that is an indefinite period of time and an indefinite
> amount of memory. There are provisions in place to impose upper limits
> on how long an operation can last on the HW, and the mmu-notifier is
> there to ensure we do unpin the memory on demand. However cancelling a
> HW operation (which will result in data loss and often process
> termination due to an unfortunate sequence of events when userspace
> fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not
> a good choice.

OK, thanks for the clarification. What and when should MM intervene to
prevent potential OOM?
  
[...]
> > Btw. overall intention of the patch is not really clear to me. Do I get
> > it right that this is going to reduce latency of the reclaim for pages
> > that are not reclaimable anyway because they are pinned? If yes do we
> > have any numbers for that.
> 
> I can plug it into a microbenchmark ala cycletest to show the impact...
> Memory filled with 64M gup objects, random utilisation of those with
> the GPU; background process filling the pagecache with find /; reporting
> the time difference from the expected expiry of a timer with the actual:
> [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each
> measuring mean latency for 20s -- mean is probably a really bad choice
> here, we need 50/90/95/99]
> 
> direct reclaim calling mmu-notifier:
> gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us
> 
> skipping try_to_unmap_one with page_maybe_dma_pinned:
> gem_syslatency: cycles=1965, latency mean=597.971us max=28462us
> 
> Baseline (background find /; application touched all memory, but no HW
> ops)
> gem_syslatency: cycles=0, latency mean=6.695us max=77us
> 
> Compare with the time to allocate a single THP against load:
> 
> Baseline:
> gem_syslatency: cycles=0, latency mean=1541.562us max=52196us
> Direct reclaim calling mmu-notifier:
> gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us
> page_maybe_dma_pinned skip:
> gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us
> 
> Take with a massive pinch of salt. I expect, once I find the right
> sequence, to reliably control the induced latency on the RT thread.
> 
> But first, I have to look at why there's a correlation with HW load and
> timer latency, even with steady state usage. That's quite surprising --
> ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan
> every request submitted to HW. Just great.
> 
> With PREEMPT:
> Timer:
> Base:    gem_syslatency: cycles=0, latency mean=8.823us max=83us
> Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us
> Skip:    gem_syslatency: cycles=2677, latency mean=70.306us max=4720us
> 
> THP:
> Base:    gem_syslatency: cycles=0, latency mean=1993.693us max=201958us
> Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us
> Skip:    gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us
> 
> Earlier caveats notwithstanding; confidence in results still low.
> 
> And refine the testing somewhat, if at the very least gather enough
> samples for credible statistics.

OK, so my understanding is that the overall impact is very low. So what
is the primary motivation for the patch? Prevent from a pointless work -
aka invoke the notifier?

Chris Wilson June 25, 2020, 3:48 p.m. UTC | #14

Quoting Michal Hocko (2020-06-25 16:12:27)
> On Thu 25-06-20 12:00:47, Chris Wilson wrote:
> > Quoting Michal Hocko (2020-06-25 08:57:25)
> > > On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> > > > A general rule of thumb is that shrinkers should be fast and effective.
> > > > They are called from direct reclaim at the most incovenient of times when
> > > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > > latency than a normal anonymous page mapped multiple times. Worse the
> > > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > > in a timely manner.
> > > > 
> > > > A side effect of the LRU shrinker not being dma aware is that we will
> > > > often attempt to perform direct reclaim on the persistent group of dma
> > > > pages while continuing to use the dma HW (an issue as the HW may already
> > > > be actively waiting for the next user request), and even attempt to
> > > > reclaim a partially allocated dma object in order to satisfy pinning
> > > > the next user page for that object.
> > > 
> > > You are talking about direct reclaim but this path is shared with the
> > > background reclaim. This is a bit confusing. Maybe you just want to
> > > outline the latency in the reclaim which is more noticeable in the
> > > direct reclaim to the userspace. This would be good to be clarified.
> > > 
> > > How much memory are we talking about here btw?
> > 
> > It depends. In theory, it is used sparingly. But it is under userspace
> > control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If
> > all goes to plan the application memory is only pinned for as long as the
> > HW is using it, but that is an indefinite period of time and an indefinite
> > amount of memory. There are provisions in place to impose upper limits
> > on how long an operation can last on the HW, and the mmu-notifier is
> > there to ensure we do unpin the memory on demand. However cancelling a
> > HW operation (which will result in data loss and often process
> > termination due to an unfortunate sequence of events when userspace
> > fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not
> > a good choice.
> 
> OK, thanks for the clarification. What and when should MM intervene to
> prevent potential OOM?

Open to any and all suggestions. At the moment we have the shrinkers for
direct reclaim and kswapd [though for direct reclaim we don't like
touching active dma pages for the same reason] and the oom-notifier as a
last resort. At the moment, we start trying to flush work from the gpu
inside kswapd, but that tbh is not particularly effective. If we used
kswapd as a signal for lowmemory, we could/should be more proactive in
releasing work and returning dirty pages to shmemfs as soon as it is
complete [normally retiring completed work is lazy and does not
immediately drop dirty pages]. That would also suggest a complementary
signal for when we can go back to being lazy again.

We are also a bit too lax in applying backpressure. The focus has been on
making sure one hog does not prevent an innocent client from accessing
the HW. We use cond_synchronize_rcu() on a per-client basis if it appears
we are allocating new requests too fast when allocations start failing.
That is very probably too little too late. It does achieve the original
goal of preventing a single client from submitting requests faster than
we can retire them. But we have nothing for page allocations, other than
trying to reclaim shmemfs allocations after a NORETRY failure, hoping to
stave off the shrinker. [That fails miserably as we are never alone on
the system, ofc.]

A cgroups interface to restrict memory allocations on a per client basis
has never quite materialised. Even then we have to assume the worse and
overcommitting is the norm.

> [...]
> > > Btw. overall intention of the patch is not really clear to me. Do I get
> > > it right that this is going to reduce latency of the reclaim for pages
> > > that are not reclaimable anyway because they are pinned? If yes do we
> > > have any numbers for that.
> > 
> > I can plug it into a microbenchmark ala cycletest to show the impact...
> > Memory filled with 64M gup objects, random utilisation of those with
> > the GPU; background process filling the pagecache with find /; reporting
> > the time difference from the expected expiry of a timer with the actual:
> > [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each
> > measuring mean latency for 20s -- mean is probably a really bad choice
> > here, we need 50/90/95/99]
> > 
> > direct reclaim calling mmu-notifier:
> > gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us
> > 
> > skipping try_to_unmap_one with page_maybe_dma_pinned:
> > gem_syslatency: cycles=1965, latency mean=597.971us max=28462us
> > 
> > Baseline (background find /; application touched all memory, but no HW
> > ops)
> > gem_syslatency: cycles=0, latency mean=6.695us max=77us
> > 
> > Compare with the time to allocate a single THP against load:
> > 
> > Baseline:
> > gem_syslatency: cycles=0, latency mean=1541.562us max=52196us
> > Direct reclaim calling mmu-notifier:
> > gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us
> > page_maybe_dma_pinned skip:
> > gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us
> > 
> > Take with a massive pinch of salt. I expect, once I find the right
> > sequence, to reliably control the induced latency on the RT thread.
> > 
> > But first, I have to look at why there's a correlation with HW load and
> > timer latency, even with steady state usage. That's quite surprising --
> > ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan
> > every request submitted to HW. Just great.
> > 
> > With PREEMPT:
> > Timer:
> > Base:    gem_syslatency: cycles=0, latency mean=8.823us max=83us
> > Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us
> > Skip:    gem_syslatency: cycles=2677, latency mean=70.306us max=4720us
> > 
> > THP:
> > Base:    gem_syslatency: cycles=0, latency mean=1993.693us max=201958us
> > Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us
> > Skip:    gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us
> > 
> > Earlier caveats notwithstanding; confidence in results still low.
> > 
> > And refine the testing somewhat, if at the very least gather enough
> > samples for credible statistics.
> 
> OK, so my understanding is that the overall impact is very low. So what
> is the primary motivation for the patch? Prevent from a pointless work -
> aka invoke the notifier?

I'm working on capturing the worst cases :) I just need to find the way
to consistently steer reclaim to hitting active gup objects. And then
wrap that inside a timer, without it appearing too contrived. Does
populating a [THP] mmap seems reasonable way to measure page allocation
latency?
-Chris

Matthew Wilcox June 25, 2020, 4:05 p.m. UTC | #15

On Thu, Jun 25, 2020 at 03:40:44PM +0200, Jan Kara wrote:
> On Thu 25-06-20 12:42:09, Matthew Wilcox wrote:
> > Why are DMA pinned pages still on the LRU list at all?  I never got an
> > answer to this that made sense to me.  By definition, a page which is
> > pinned for DMA is being accessed, and needs to at the very least change
> > position on the LRU list, so just take it off the list when DMA-pinned
> > and put it back on the list when DMA-unpinned.
> 
> Well, we do mark_page_accessed() when pinning in GUP. This is not perfect
> but it's as good as it gets with CPU having no control when the page is
> actually accessed. Also taking the page off and then back to LRU list would
> increase the contention on the LRU list locks and generally cost
> performance so for short term pins it is not desirable... Otherwise I agree
> that conceptually it would make some sence although I'm not sure some
> places wouldn't get confused by e.g. page cache pages not being on LRU
> list. 

We could/should do what we do for mlocked pages:
Documentation/vm/unevictable-lru.rst

I think 'case five' is wrong and needs to be removed.  Pinning is
inappropriate for "I'm going to modify the page myself".

Yang Shi June 25, 2020, 4:32 p.m. UTC | #16

On Thu, Jun 25, 2020 at 4:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A side effect of the LRU shrinker not being dma aware is that we will
> > often attempt to perform direct reclaim on the persistent group of dma
> > pages while continuing to use the dma HW (an issue as the HW may already
> > be actively waiting for the next user request), and even attempt to
> > reclaim a partially allocated dma object in order to satisfy pinning
> > the next user page for that object.
> >
> > It is to be expected that such pages are made available for reclaim at
> > the end of the dma operation [unpin_user_pages()], and for truly
> > longterm pins to be proactively recovered via device specific shrinkers
> > [i.e. stop the HW, allow the pages to be returned to the system, and
> > then compete again for the memory].
>
> Why are DMA pinned pages still on the LRU list at all?  I never got an
> answer to this that made sense to me.  By definition, a page which is
> pinned for DMA is being accessed, and needs to at the very least change
> position on the LRU list, so just take it off the list when DMA-pinned
> and put it back on the list when DMA-unpinned.

Sounds reasonable to me. In the earlier email I suggested skip isolate
dma pinned page in scan phase, but if they are long term pinned, it
seems preferred to put them on the unevictable lru IMHO.

>
> This overly complex lease stuff must have some reason for existing, but
> I still don't get it.
>

mm: Skip opportunistic reclaim for dma pinned pages

Commit Message

Comments

Patch