Message ID | 20200624191417.16735-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Skip opportunistic reclaim for dma pinned pages | expand |
On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > A general rule of thumb is that shrinkers should be fast and effective. > They are called from direct reclaim at the most incovenient of times when > the caller is waiting for a page. If we attempt to reclaim a page being > pinned for active dma [pin_user_pages()], we will incur far greater > latency than a normal anonymous page mapped multiple times. Worse the > page may be in use indefinitely by the HW and unable to be reclaimed > in a timely manner. A pinned page can't be migrated, discarded or swapped by definition - it would cause data corruption. So, how do things even get here and/or work today at all? I think the explanation is missing something important. Jason
On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > A general rule of thumb is that shrinkers should be fast and effective. > > They are called from direct reclaim at the most incovenient of times when > > the caller is waiting for a page. If we attempt to reclaim a page being > > pinned for active dma [pin_user_pages()], we will incur far greater > > latency than a normal anonymous page mapped multiple times. Worse the > > page may be in use indefinitely by the HW and unable to be reclaimed > > in a timely manner. > > A pinned page can't be migrated, discarded or swapped by definition - > it would cause data corruption. > > So, how do things even get here and/or work today at all? I think the > explanation is missing something important. The __remove_mapping() will try to freeze page count if the count is expected otherwise just not discard the page. I'm not quite sure why the check is done that late, my wild guess is to check the refcount at the last minute so there might be a chance the pin gets released right before it. But I noticed a bug in __remove_ampping() for THP since THP's dma pinned count is recorded in the tail page's hpage_pinned_refcount instead of refcount. So, the refcount freeze might be successful for pinned THP. Chris's patch could solve this issue too, but I'm not sure if it is worth backing earlier once dma pinned page is met. If it is worth, the follow-up question is why not just skip such page in scan phase? > > Jason >
Quoting Jason Gunthorpe (2020-06-24 20:21:16) > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > A general rule of thumb is that shrinkers should be fast and effective. > > They are called from direct reclaim at the most incovenient of times when > > the caller is waiting for a page. If we attempt to reclaim a page being > > pinned for active dma [pin_user_pages()], we will incur far greater > > latency than a normal anonymous page mapped multiple times. Worse the > > page may be in use indefinitely by the HW and unable to be reclaimed > > in a timely manner. > > A pinned page can't be migrated, discarded or swapped by definition - > it would cause data corruption. > > So, how do things even get here and/or work today at all? I think the > explanation is missing something important. [<0>] userptr_mn_invalidate_range_start+0xa7/0x170 [i915] [<0>] __mmu_notifier_invalidate_range_start+0x110/0x150 [<0>] try_to_unmap_one+0x790/0x870 [<0>] rmap_walk_file+0xe9/0x230 [<0>] rmap_walk+0x27/0x30 [<0>] try_to_unmap+0x89/0xc0 [<0>] shrink_page_list+0x88a/0xf50 [<0>] shrink_inactive_list+0x137/0x2f0 [<0>] shrink_lruvec+0x4ec/0x5f0 [<0>] shrink_node+0x15d/0x410 [<0>] try_to_free_pages+0x17f/0x430 [<0>] __alloc_pages_slowpath+0x2ab/0xcc0 [<0>] __alloc_pages_nodemask+0x1ad/0x1e0 [<0>] new_slab+0x2d8/0x310 [<0>] ___slab_alloc.constprop.0+0x288/0x520 [<0>] __slab_alloc.constprop.0+0xd/0x20 [<0>] kmem_cache_alloc_trace+0x1ad/0x1c0 and that hits an active pin_user_pages object. Is there some information that would help in particular? -Chris
On 2020-06-24 12:21, Jason Gunthorpe wrote: > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: >> A general rule of thumb is that shrinkers should be fast and effective. >> They are called from direct reclaim at the most incovenient of times when >> the caller is waiting for a page. If we attempt to reclaim a page being >> pinned for active dma [pin_user_pages()], we will incur far greater >> latency than a normal anonymous page mapped multiple times. Worse the >> page may be in use indefinitely by the HW and unable to be reclaimed >> in a timely manner. > > A pinned page can't be migrated, discarded or swapped by definition - > it would cause data corruption. > > So, how do things even get here and/or work today at all? I think the > explanation is missing something important. > Well, those activities generally try to unmap the page, and have to be prepared to deal with failure to unmap. From my reading, it seemed very clear. What's less clear is why the comment and the commit description only talk about reclaim, when there are additional things that call try_to_unmap(), including: migrate_vma_unmap() split_huge_page_to_list() --> unmap_page() I do like this code change, though. And I *think* it's actually safe to do this, as it stays away from writeback or other filesystem activity. But let me double check that, in case I'm forgetting something. thanks,
On Wed, Jun 24, 2020 at 1:23 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > > A general rule of thumb is that shrinkers should be fast and effective. > > > They are called from direct reclaim at the most incovenient of times when > > > the caller is waiting for a page. If we attempt to reclaim a page being > > > pinned for active dma [pin_user_pages()], we will incur far greater > > > latency than a normal anonymous page mapped multiple times. Worse the > > > page may be in use indefinitely by the HW and unable to be reclaimed > > > in a timely manner. > > > > A pinned page can't be migrated, discarded or swapped by definition - > > it would cause data corruption. > > > > So, how do things even get here and/or work today at all? I think the > > explanation is missing something important. > > The __remove_mapping() will try to freeze page count if the count is > expected otherwise just not discard the page. I'm not quite sure why > the check is done that late, my wild guess is to check the refcount at > the last minute so there might be a chance the pin gets released right > before it. > > But I noticed a bug in __remove_ampping() for THP since THP's dma > pinned count is recorded in the tail page's hpage_pinned_refcount > instead of refcount. So, the refcount freeze might be successful for > pinned THP. Chris's patch could solve this issue too, but I'm not This bug is not valid. I just realized try_grab_page() would increase both refcount and hpage_pinned_refcount. > sure if it is worth backing earlier once dma pinned page is met. If it > is worth, the follow-up question is why not just skip such page in > scan phase? > > > > > Jason > >
On Wed, Jun 24, 2020 at 01:47:23PM -0700, John Hubbard wrote: > On 2020-06-24 12:21, Jason Gunthorpe wrote: > > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > > A general rule of thumb is that shrinkers should be fast and effective. > > > They are called from direct reclaim at the most incovenient of times when > > > the caller is waiting for a page. If we attempt to reclaim a page being > > > pinned for active dma [pin_user_pages()], we will incur far greater > > > latency than a normal anonymous page mapped multiple times. Worse the > > > page may be in use indefinitely by the HW and unable to be reclaimed > > > in a timely manner. > > > > A pinned page can't be migrated, discarded or swapped by definition - > > it would cause data corruption. > > > > So, how do things even get here and/or work today at all? I think the > > explanation is missing something important. > > > > Well, those activities generally try to unmap the page, and > have to be prepared to deal with failure to unmap. From my reading, > it seemed very clear. I think Yang explained it - the page is removed from the mappings but freeing it does not happen because page_ref_freeze() does not succeed due to the pin. Presumably the mappings can reconnect to the same physical page if it is re-faulted to avoid any data corruption. So, the issue here is the mappings are trashed while the page remains - and trashing the mapping triggers a mmu notifier which upsets i915. > What's less clear is why the comment and the commit description > only talk about reclaim, when there are additional things that call > try_to_unmap(), including: > > migrate_vma_unmap() > split_huge_page_to_list() --> unmap_page() It looks like the same unmap first then abort if the refcount is still elevated design as shrink_page_list() ? > I do like this code change, though. And I *think* it's actually safe to > do this, as it stays away from writeback or other filesystem activity. > But let me double check that, in case I'm forgetting something. It would be nice to have an explanation why it is OK now to change it.. I don't know, but could it be that try_to_unmap() has to be done before checking the refcount as each mapping is included in the refcount? ie we couldn't know a DMA pin was active in advance? Now that we have your pin stuff we can detect a DMA pin without doing all the unmaps? Jason
On 2020-06-24 16:20, Jason Gunthorpe wrote: ... > I think Yang explained it - the page is removed from the mappings but > freeing it does not happen because page_ref_freeze() does not succeed > due to the pin. > > Presumably the mappings can reconnect to the same physical page if > it is re-faulted to avoid any data corruption. > > So, the issue here is the mappings are trashed while the page remains > - and trashing the mapping triggers a mmu notifier which upsets i915. > >> What's less clear is why the comment and the commit description >> only talk about reclaim, when there are additional things that call >> try_to_unmap(), including: >> >> migrate_vma_unmap() >> split_huge_page_to_list() --> unmap_page() > > It looks like the same unmap first then abort if the refcount is still > elevated design as shrink_page_list() ? Yes. I was just wondering why the documentation here seems to ignore the other, non-reclaim cases. Anyway, though... > >> I do like this code change, though. And I *think* it's actually safe to >> do this, as it stays away from writeback or other filesystem activity. >> But let me double check that, in case I'm forgetting something. ...OK, I've checked, and I like it a little bit less now. Mainly for structural reasons, though. I think it would work correctly. But here's a concern: try_to_unmap() should only fail to unmap if there is a reason to not unmap. Having a page be pinned for dma is a reason to not *free* a page, and it's also a reason to be careful about writeback and page buffers for writeback and such. But I'm not sure that it's a reason to fail to remove mappings. True, most (all?) of the reasons that we remove mappings, generally are for things that are not allowed while a page is dma-pinned...at least, today. But still, there's nothing fundamental about a mapping that should prevent it from coming or going while a page is undergoing dma. So, it's merely a convenient, now-misnamed location in the call stack to fail out. That's not great. It might be better, as Jason hints at below, to fail out a little earlier, instead. That would lead to a more places to call page_maybe_dma_pinned(), but that's not a real problem, because it's still a small number of places. After writing all of that...I don't feel strongly about it, because TTU is kind of synonymous with "I'm about to do a dma-pin-unfriendly operation". Maybe some of the more experienced fs or mm people have strong opinions one way or the other? > > It would be nice to have an explanation why it is OK now to change > it.. Yes. Definitely good to explain that in the commit log. I think it's triggered by the existence of page_maybe_dma_pinned(). Until that was added, figuring out if dma was involved required basically just guesswork. Now we have a way to guess much more accurately. :) > > I don't know, but could it be that try_to_unmap() has to be done > before checking the refcount as each mapping is included in the > refcount? ie we couldn't know a DMA pin was active in advance? > > Now that we have your pin stuff we can detect a DMA pin without doing > all the unmaps? > Once something calls pin_user_page*(), then the pages will be marked as dma-pinned, yes. So no, there is no need to wait until try_to_unmap() to find out. A final note: depending on where page_maybe_dma_pinned() ends up getting called, this might prevent a fair number of the problems that Jan originally reported [1], and that I also reported separately! Well, not all of the problems, and only after the filesystems get converted to call pin_user_pages() (working on that next), but...I think it would actually avoid the crash our customer reported back in early 2018. Even though we don't have the full file lease + pin_user_pages() solution in place. That's because reclaim is what triggers the problems that we saw. And with this patch, we bail out of reclaim early. [1] https://www.spinics.net/lists/linux-mm/msg142700.html thanks,
On Wed 24-06-20 20:14:17, Chris Wilson wrote: > A general rule of thumb is that shrinkers should be fast and effective. > They are called from direct reclaim at the most incovenient of times when > the caller is waiting for a page. If we attempt to reclaim a page being > pinned for active dma [pin_user_pages()], we will incur far greater > latency than a normal anonymous page mapped multiple times. Worse the > page may be in use indefinitely by the HW and unable to be reclaimed > in a timely manner. > > A side effect of the LRU shrinker not being dma aware is that we will > often attempt to perform direct reclaim on the persistent group of dma > pages while continuing to use the dma HW (an issue as the HW may already > be actively waiting for the next user request), and even attempt to > reclaim a partially allocated dma object in order to satisfy pinning > the next user page for that object. You are talking about direct reclaim but this path is shared with the background reclaim. This is a bit confusing. Maybe you just want to outline the latency in the reclaim which is more noticeable in the direct reclaim to the userspace. This would be good to be clarified. How much memory are we talking about here btw? > It is to be expected that such pages are made available for reclaim at > the end of the dma operation [unpin_user_pages()], and for truly > longterm pins to be proactively recovered via device specific shrinkers > [i.e. stop the HW, allow the pages to be returned to the system, and > then compete again for the memory]. Is the later implemented? Btw. overall intention of the patch is not really clear to me. Do I get it right that this is going to reduce latency of the reclaim for pages that are not reclaimable anyway because they are pinned? If yes do we have any numbers for that. It would be also good to explain why the bail out is implemented in try_to_unmap rather than shrink_shrink_page_list. > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > --- > This seems perhaps a little devious and overzealous. Is there a more > appropriate TTU flag? Would there be a way to limit its effect to say > FOLL_LONGTERM? Doing the migration first would seem to be sensible if > we disable opportunistic migration for the duration of the pin. > --- > mm/rmap.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 5fe2dedce1fc..374c6e65551b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1393,6 +1393,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > is_zone_device_page(page) && !is_device_private_page(page)) > return true; > > + /* > + * Try and fail early to revoke a costly DMA pinned page. > + * > + * Reclaiming an active DMA page requires stopping the hardware > + * and flushing access. [Hardware that does support pagefaulting, > + * and so can quickly revoke DMA pages at any time, does not need > + * to pin the DMA page.] At worst, the page may be indefinitely in > + * use by the hardware. Even at best it will take far longer to > + * revoke the access via the mmu notifier, forcing that latency > + * onto our callers rather than the consumer of the HW. As we are > + * called during opportunistic direct reclaim, declare the > + * opportunity cost too high and ignore the page. > + */ > + if (page_maybe_dma_pinned(page)) > + return true; I do not understand why the page table walk needs to be done. The page is going to be pinned no matter how many page tables are mapping it right? > + > if (flags & TTU_SPLIT_HUGE_PMD) { > split_huge_pmd_address(vma, address, > flags & TTU_SPLIT_FREEZE, page); > -- > 2.20.1
Quoting Michal Hocko (2020-06-25 08:57:25) > On Wed 24-06-20 20:14:17, Chris Wilson wrote: > > A general rule of thumb is that shrinkers should be fast and effective. > > They are called from direct reclaim at the most incovenient of times when > > the caller is waiting for a page. If we attempt to reclaim a page being > > pinned for active dma [pin_user_pages()], we will incur far greater > > latency than a normal anonymous page mapped multiple times. Worse the > > page may be in use indefinitely by the HW and unable to be reclaimed > > in a timely manner. > > > > A side effect of the LRU shrinker not being dma aware is that we will > > often attempt to perform direct reclaim on the persistent group of dma > > pages while continuing to use the dma HW (an issue as the HW may already > > be actively waiting for the next user request), and even attempt to > > reclaim a partially allocated dma object in order to satisfy pinning > > the next user page for that object. > > You are talking about direct reclaim but this path is shared with the > background reclaim. This is a bit confusing. Maybe you just want to > outline the latency in the reclaim which is more noticeable in the > direct reclaim to the userspace. This would be good to be clarified. > > How much memory are we talking about here btw? It depends. In theory, it is used sparingly. But it is under userspace control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If all goes to plan the application memory is only pinned for as long as the HW is using it, but that is an indefinite period of time and an indefinite amount of memory. There are provisions in place to impose upper limits on how long an operation can last on the HW, and the mmu-notifier is there to ensure we do unpin the memory on demand. However cancelling a HW operation (which will result in data loss and often process termination due to an unfortunate sequence of events when userspace fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not a good choice. > > It is to be expected that such pages are made available for reclaim at > > the end of the dma operation [unpin_user_pages()], and for truly > > longterm pins to be proactively recovered via device specific shrinkers > > [i.e. stop the HW, allow the pages to be returned to the system, and > > then compete again for the memory]. > > Is the later implemented? Depends on driver, for i915 we had a shrinker since before we introduced get_user_pages objects. We have the same problem with trying to mitigate userspace wanting to use all of memory for a single operation, whether it's from shmemfs or get_user_pages. > Btw. overall intention of the patch is not really clear to me. Do I get > it right that this is going to reduce latency of the reclaim for pages > that are not reclaimable anyway because they are pinned? If yes do we > have any numbers for that. I can plug it into a microbenchmark ala cycletest to show the impact... Memory filled with 64M gup objects, random utilisation of those with the GPU; background process filling the pagecache with find /; reporting the time difference from the expected expiry of a timer with the actual: [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each measuring mean latency for 20s -- mean is probably a really bad choice here, we need 50/90/95/99] direct reclaim calling mmu-notifier: gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us skipping try_to_unmap_one with page_maybe_dma_pinned: gem_syslatency: cycles=1965, latency mean=597.971us max=28462us Baseline (background find /; application touched all memory, but no HW ops) gem_syslatency: cycles=0, latency mean=6.695us max=77us Compare with the time to allocate a single THP against load: Baseline: gem_syslatency: cycles=0, latency mean=1541.562us max=52196us Direct reclaim calling mmu-notifier: gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us page_maybe_dma_pinned skip: gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us Take with a massive pinch of salt. I expect, once I find the right sequence, to reliably control the induced latency on the RT thread. But first, I have to look at why there's a correlation with HW load and timer latency, even with steady state usage. That's quite surprising -- ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan every request submitted to HW. Just great. With PREEMPT: Timer: Base: gem_syslatency: cycles=0, latency mean=8.823us max=83us Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us Skip: gem_syslatency: cycles=2677, latency mean=70.306us max=4720us THP: Base: gem_syslatency: cycles=0, latency mean=1993.693us max=201958us Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us Skip: gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us Earlier caveats notwithstanding; confidence in results still low. And refine the testing somewhat, if at the very least gather enough samples for credible statistics. > It would be also good to explain why the bail out is implemented in > try_to_unmap rather than shrink_shrink_page_list. I'm in the process of working up the chain, having started with trying to circumvent the wait for reclaim in the mmu notifier callback in the driver. -Chris
On Wed 24-06-20 17:11:30, John Hubbard wrote: > On 2020-06-24 16:20, Jason Gunthorpe wrote: > > > I do like this code change, though. And I *think* it's actually safe to > > > do this, as it stays away from writeback or other filesystem activity. > > > But let me double check that, in case I'm forgetting something. > > ...OK, I've checked, and I like it a little bit less now. Mainly for > structural reasons, though. I think it would work correctly. But > here's a concern: try_to_unmap() should only fail to unmap if there is a > reason to not unmap. Having a page be pinned for dma is a reason to not > *free* a page, and it's also a reason to be careful about writeback and > page buffers for writeback and such. But I'm not sure that it's a reason > to fail to remove mappings. > > True, most (all?) of the reasons that we remove mappings, generally are > for things that are not allowed while a page is dma-pinned...at least, > today. But still, there's nothing fundamental about a mapping that > should prevent it from coming or going while a page is undergoing > dma. > > So, it's merely a convenient, now-misnamed location in the call stack > to fail out. That's not great. It might be better, as Jason hints at > below, to fail out a little earlier, instead. That would lead to a more > places to call page_maybe_dma_pinned(), but that's not a real problem, > because it's still a small number of places. Agreed, I think that shrink_page_list() as Michal writes is a better place for the page_may_be_dma_pinned() check. But other than that I agree it is good to avoid all the unnecessary work of preparing a page for reclaim when we can now check there's no hope of reclaiming it (at this time). > > I don't know, but could it be that try_to_unmap() has to be done > > before checking the refcount as each mapping is included in the > > refcount? ie we couldn't know a DMA pin was active in advance? > > > > Now that we have your pin stuff we can detect a DMA pin without doing > > all the unmaps? > > > > Once something calls pin_user_page*(), then the pages will be marked > as dma-pinned, yes. So no, there is no need to wait until try_to_unmap() > to find out. Yeah, I'm pretty sure the page ref check happens so late because earlier it was impossible to tell whether there are "external" page references reclaim cannot get rid of - filesystem may (but need not!) hold a page reference for its private data, page tables will hold references as well, etc. Now with page_may_be_dma_pinned() we can detect at least one class of external references (i.e. pins) early so it's stupid not to do that. > A final note: depending on where page_maybe_dma_pinned() ends up > getting called, this might prevent a fair number of the problems that > Jan originally reported [1], and that I also reported separately! > > Well, not all of the problems, and only after the filesystems get > converted to call pin_user_pages() (working on that next), but...I think > it would actually avoid the crash our customer reported back in early > 2018. Even though we don't have the full file lease + pin_user_pages() > solution in place. > > That's because reclaim is what triggers the problems that we saw. And > with this patch, we bail out of reclaim early. I agree that with this change, some races will become much less likely for some usecases. But as you say, it's not a full solution. Honza
On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > A side effect of the LRU shrinker not being dma aware is that we will > often attempt to perform direct reclaim on the persistent group of dma > pages while continuing to use the dma HW (an issue as the HW may already > be actively waiting for the next user request), and even attempt to > reclaim a partially allocated dma object in order to satisfy pinning > the next user page for that object. > > It is to be expected that such pages are made available for reclaim at > the end of the dma operation [unpin_user_pages()], and for truly > longterm pins to be proactively recovered via device specific shrinkers > [i.e. stop the HW, allow the pages to be returned to the system, and > then compete again for the memory]. Why are DMA pinned pages still on the LRU list at all? I never got an answer to this that made sense to me. By definition, a page which is pinned for DMA is being accessed, and needs to at the very least change position on the LRU list, so just take it off the list when DMA-pinned and put it back on the list when DMA-unpinned. This overly complex lease stuff must have some reason for existing, but I still don't get it.
On Thu 25-06-20 12:42:09, Matthew Wilcox wrote: > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > A side effect of the LRU shrinker not being dma aware is that we will > > often attempt to perform direct reclaim on the persistent group of dma > > pages while continuing to use the dma HW (an issue as the HW may already > > be actively waiting for the next user request), and even attempt to > > reclaim a partially allocated dma object in order to satisfy pinning > > the next user page for that object. > > > > It is to be expected that such pages are made available for reclaim at > > the end of the dma operation [unpin_user_pages()], and for truly > > longterm pins to be proactively recovered via device specific shrinkers > > [i.e. stop the HW, allow the pages to be returned to the system, and > > then compete again for the memory]. > > Why are DMA pinned pages still on the LRU list at all? I never got an > answer to this that made sense to me. By definition, a page which is > pinned for DMA is being accessed, and needs to at the very least change > position on the LRU list, so just take it off the list when DMA-pinned > and put it back on the list when DMA-unpinned. Well, we do mark_page_accessed() when pinning in GUP. This is not perfect but it's as good as it gets with CPU having no control when the page is actually accessed. Also taking the page off and then back to LRU list would increase the contention on the LRU list locks and generally cost performance so for short term pins it is not desirable... Otherwise I agree that conceptually it would make some sence although I'm not sure some places wouldn't get confused by e.g. page cache pages not being on LRU list. Honza
On Thu 25-06-20 12:00:47, Chris Wilson wrote: > Quoting Michal Hocko (2020-06-25 08:57:25) > > On Wed 24-06-20 20:14:17, Chris Wilson wrote: > > > A general rule of thumb is that shrinkers should be fast and effective. > > > They are called from direct reclaim at the most incovenient of times when > > > the caller is waiting for a page. If we attempt to reclaim a page being > > > pinned for active dma [pin_user_pages()], we will incur far greater > > > latency than a normal anonymous page mapped multiple times. Worse the > > > page may be in use indefinitely by the HW and unable to be reclaimed > > > in a timely manner. > > > > > > A side effect of the LRU shrinker not being dma aware is that we will > > > often attempt to perform direct reclaim on the persistent group of dma > > > pages while continuing to use the dma HW (an issue as the HW may already > > > be actively waiting for the next user request), and even attempt to > > > reclaim a partially allocated dma object in order to satisfy pinning > > > the next user page for that object. > > > > You are talking about direct reclaim but this path is shared with the > > background reclaim. This is a bit confusing. Maybe you just want to > > outline the latency in the reclaim which is more noticeable in the > > direct reclaim to the userspace. This would be good to be clarified. > > > > How much memory are we talking about here btw? > > It depends. In theory, it is used sparingly. But it is under userspace > control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If > all goes to plan the application memory is only pinned for as long as the > HW is using it, but that is an indefinite period of time and an indefinite > amount of memory. There are provisions in place to impose upper limits > on how long an operation can last on the HW, and the mmu-notifier is > there to ensure we do unpin the memory on demand. However cancelling a > HW operation (which will result in data loss and often process > termination due to an unfortunate sequence of events when userspace > fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not > a good choice. OK, thanks for the clarification. What and when should MM intervene to prevent potential OOM? [...] > > Btw. overall intention of the patch is not really clear to me. Do I get > > it right that this is going to reduce latency of the reclaim for pages > > that are not reclaimable anyway because they are pinned? If yes do we > > have any numbers for that. > > I can plug it into a microbenchmark ala cycletest to show the impact... > Memory filled with 64M gup objects, random utilisation of those with > the GPU; background process filling the pagecache with find /; reporting > the time difference from the expected expiry of a timer with the actual: > [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each > measuring mean latency for 20s -- mean is probably a really bad choice > here, we need 50/90/95/99] > > direct reclaim calling mmu-notifier: > gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us > > skipping try_to_unmap_one with page_maybe_dma_pinned: > gem_syslatency: cycles=1965, latency mean=597.971us max=28462us > > Baseline (background find /; application touched all memory, but no HW > ops) > gem_syslatency: cycles=0, latency mean=6.695us max=77us > > Compare with the time to allocate a single THP against load: > > Baseline: > gem_syslatency: cycles=0, latency mean=1541.562us max=52196us > Direct reclaim calling mmu-notifier: > gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us > page_maybe_dma_pinned skip: > gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us > > Take with a massive pinch of salt. I expect, once I find the right > sequence, to reliably control the induced latency on the RT thread. > > But first, I have to look at why there's a correlation with HW load and > timer latency, even with steady state usage. That's quite surprising -- > ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan > every request submitted to HW. Just great. > > With PREEMPT: > Timer: > Base: gem_syslatency: cycles=0, latency mean=8.823us max=83us > Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us > Skip: gem_syslatency: cycles=2677, latency mean=70.306us max=4720us > > THP: > Base: gem_syslatency: cycles=0, latency mean=1993.693us max=201958us > Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us > Skip: gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us > > Earlier caveats notwithstanding; confidence in results still low. > > And refine the testing somewhat, if at the very least gather enough > samples for credible statistics. OK, so my understanding is that the overall impact is very low. So what is the primary motivation for the patch? Prevent from a pointless work - aka invoke the notifier?
Quoting Michal Hocko (2020-06-25 16:12:27) > On Thu 25-06-20 12:00:47, Chris Wilson wrote: > > Quoting Michal Hocko (2020-06-25 08:57:25) > > > On Wed 24-06-20 20:14:17, Chris Wilson wrote: > > > > A general rule of thumb is that shrinkers should be fast and effective. > > > > They are called from direct reclaim at the most incovenient of times when > > > > the caller is waiting for a page. If we attempt to reclaim a page being > > > > pinned for active dma [pin_user_pages()], we will incur far greater > > > > latency than a normal anonymous page mapped multiple times. Worse the > > > > page may be in use indefinitely by the HW and unable to be reclaimed > > > > in a timely manner. > > > > > > > > A side effect of the LRU shrinker not being dma aware is that we will > > > > often attempt to perform direct reclaim on the persistent group of dma > > > > pages while continuing to use the dma HW (an issue as the HW may already > > > > be actively waiting for the next user request), and even attempt to > > > > reclaim a partially allocated dma object in order to satisfy pinning > > > > the next user page for that object. > > > > > > You are talking about direct reclaim but this path is shared with the > > > background reclaim. This is a bit confusing. Maybe you just want to > > > outline the latency in the reclaim which is more noticeable in the > > > direct reclaim to the userspace. This would be good to be clarified. > > > > > > How much memory are we talking about here btw? > > > > It depends. In theory, it is used sparingly. But it is under userspace > > control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If > > all goes to plan the application memory is only pinned for as long as the > > HW is using it, but that is an indefinite period of time and an indefinite > > amount of memory. There are provisions in place to impose upper limits > > on how long an operation can last on the HW, and the mmu-notifier is > > there to ensure we do unpin the memory on demand. However cancelling a > > HW operation (which will result in data loss and often process > > termination due to an unfortunate sequence of events when userspace > > fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not > > a good choice. > > OK, thanks for the clarification. What and when should MM intervene to > prevent potential OOM? Open to any and all suggestions. At the moment we have the shrinkers for direct reclaim and kswapd [though for direct reclaim we don't like touching active dma pages for the same reason] and the oom-notifier as a last resort. At the moment, we start trying to flush work from the gpu inside kswapd, but that tbh is not particularly effective. If we used kswapd as a signal for lowmemory, we could/should be more proactive in releasing work and returning dirty pages to shmemfs as soon as it is complete [normally retiring completed work is lazy and does not immediately drop dirty pages]. That would also suggest a complementary signal for when we can go back to being lazy again. We are also a bit too lax in applying backpressure. The focus has been on making sure one hog does not prevent an innocent client from accessing the HW. We use cond_synchronize_rcu() on a per-client basis if it appears we are allocating new requests too fast when allocations start failing. That is very probably too little too late. It does achieve the original goal of preventing a single client from submitting requests faster than we can retire them. But we have nothing for page allocations, other than trying to reclaim shmemfs allocations after a NORETRY failure, hoping to stave off the shrinker. [That fails miserably as we are never alone on the system, ofc.] A cgroups interface to restrict memory allocations on a per client basis has never quite materialised. Even then we have to assume the worse and overcommitting is the norm. > [...] > > > Btw. overall intention of the patch is not really clear to me. Do I get > > > it right that this is going to reduce latency of the reclaim for pages > > > that are not reclaimable anyway because they are pinned? If yes do we > > > have any numbers for that. > > > > I can plug it into a microbenchmark ala cycletest to show the impact... > > Memory filled with 64M gup objects, random utilisation of those with > > the GPU; background process filling the pagecache with find /; reporting > > the time difference from the expected expiry of a timer with the actual: > > [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each > > measuring mean latency for 20s -- mean is probably a really bad choice > > here, we need 50/90/95/99] > > > > direct reclaim calling mmu-notifier: > > gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us > > > > skipping try_to_unmap_one with page_maybe_dma_pinned: > > gem_syslatency: cycles=1965, latency mean=597.971us max=28462us > > > > Baseline (background find /; application touched all memory, but no HW > > ops) > > gem_syslatency: cycles=0, latency mean=6.695us max=77us > > > > Compare with the time to allocate a single THP against load: > > > > Baseline: > > gem_syslatency: cycles=0, latency mean=1541.562us max=52196us > > Direct reclaim calling mmu-notifier: > > gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us > > page_maybe_dma_pinned skip: > > gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us > > > > Take with a massive pinch of salt. I expect, once I find the right > > sequence, to reliably control the induced latency on the RT thread. > > > > But first, I have to look at why there's a correlation with HW load and > > timer latency, even with steady state usage. That's quite surprising -- > > ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan > > every request submitted to HW. Just great. > > > > With PREEMPT: > > Timer: > > Base: gem_syslatency: cycles=0, latency mean=8.823us max=83us > > Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us > > Skip: gem_syslatency: cycles=2677, latency mean=70.306us max=4720us > > > > THP: > > Base: gem_syslatency: cycles=0, latency mean=1993.693us max=201958us > > Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us > > Skip: gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us > > > > Earlier caveats notwithstanding; confidence in results still low. > > > > And refine the testing somewhat, if at the very least gather enough > > samples for credible statistics. > > OK, so my understanding is that the overall impact is very low. So what > is the primary motivation for the patch? Prevent from a pointless work - > aka invoke the notifier? I'm working on capturing the worst cases :) I just need to find the way to consistently steer reclaim to hitting active gup objects. And then wrap that inside a timer, without it appearing too contrived. Does populating a [THP] mmap seems reasonable way to measure page allocation latency? -Chris
On Thu, Jun 25, 2020 at 03:40:44PM +0200, Jan Kara wrote: > On Thu 25-06-20 12:42:09, Matthew Wilcox wrote: > > Why are DMA pinned pages still on the LRU list at all? I never got an > > answer to this that made sense to me. By definition, a page which is > > pinned for DMA is being accessed, and needs to at the very least change > > position on the LRU list, so just take it off the list when DMA-pinned > > and put it back on the list when DMA-unpinned. > > Well, we do mark_page_accessed() when pinning in GUP. This is not perfect > but it's as good as it gets with CPU having no control when the page is > actually accessed. Also taking the page off and then back to LRU list would > increase the contention on the LRU list locks and generally cost > performance so for short term pins it is not desirable... Otherwise I agree > that conceptually it would make some sence although I'm not sure some > places wouldn't get confused by e.g. page cache pages not being on LRU > list. We could/should do what we do for mlocked pages: Documentation/vm/unevictable-lru.rst I think 'case five' is wrong and needs to be removed. Pinning is inappropriate for "I'm going to modify the page myself".
On Thu, Jun 25, 2020 at 4:42 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote: > > A side effect of the LRU shrinker not being dma aware is that we will > > often attempt to perform direct reclaim on the persistent group of dma > > pages while continuing to use the dma HW (an issue as the HW may already > > be actively waiting for the next user request), and even attempt to > > reclaim a partially allocated dma object in order to satisfy pinning > > the next user page for that object. > > > > It is to be expected that such pages are made available for reclaim at > > the end of the dma operation [unpin_user_pages()], and for truly > > longterm pins to be proactively recovered via device specific shrinkers > > [i.e. stop the HW, allow the pages to be returned to the system, and > > then compete again for the memory]. > > Why are DMA pinned pages still on the LRU list at all? I never got an > answer to this that made sense to me. By definition, a page which is > pinned for DMA is being accessed, and needs to at the very least change > position on the LRU list, so just take it off the list when DMA-pinned > and put it back on the list when DMA-unpinned. Sounds reasonable to me. In the earlier email I suggested skip isolate dma pinned page in scan phase, but if they are long term pinned, it seems preferred to put them on the unevictable lru IMHO. > > This overly complex lease stuff must have some reason for existing, but > I still don't get it. >
diff --git a/mm/rmap.c b/mm/rmap.c index 5fe2dedce1fc..374c6e65551b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1393,6 +1393,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, is_zone_device_page(page) && !is_device_private_page(page)) return true; + /* + * Try and fail early to revoke a costly DMA pinned page. + * + * Reclaiming an active DMA page requires stopping the hardware + * and flushing access. [Hardware that does support pagefaulting, + * and so can quickly revoke DMA pages at any time, does not need + * to pin the DMA page.] At worst, the page may be indefinitely in + * use by the hardware. Even at best it will take far longer to + * revoke the access via the mmu notifier, forcing that latency + * onto our callers rather than the consumer of the HW. As we are + * called during opportunistic direct reclaim, declare the + * opportunity cost too high and ignore the page. + */ + if (page_maybe_dma_pinned(page)) + return true; + if (flags & TTU_SPLIT_HUGE_PMD) { split_huge_pmd_address(vma, address, flags & TTU_SPLIT_FREEZE, page);
A general rule of thumb is that shrinkers should be fast and effective. They are called from direct reclaim at the most incovenient of times when the caller is waiting for a page. If we attempt to reclaim a page being pinned for active dma [pin_user_pages()], we will incur far greater latency than a normal anonymous page mapped multiple times. Worse the page may be in use indefinitely by the HW and unable to be reclaimed in a timely manner. A side effect of the LRU shrinker not being dma aware is that we will often attempt to perform direct reclaim on the persistent group of dma pages while continuing to use the dma HW (an issue as the HW may already be actively waiting for the next user request), and even attempt to reclaim a partially allocated dma object in order to satisfy pinning the next user page for that object. It is to be expected that such pages are made available for reclaim at the end of the dma operation [unpin_user_pages()], and for truly longterm pins to be proactively recovered via device specific shrinkers [i.e. stop the HW, allow the pages to be returned to the system, and then compete again for the memory]. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> --- This seems perhaps a little devious and overzealous. Is there a more appropriate TTU flag? Would there be a way to limit its effect to say FOLL_LONGTERM? Doing the migration first would seem to be sensible if we disable opportunistic migration for the duration of the pin. --- mm/rmap.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)