Message ID | 20190814202027.18735-3-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hmm & mmu_notifier debug/lockdep annotations | expand |
On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > In some special cases we must not block, but there's not a > spinlock, preempt-off, irqs-off or similar critical section already > that arms the might_sleep() debug checks. Add a non_block_start/end() > pair to annotate these. > > This will be used in the oom paths of mmu-notifiers, where blocking is > not allowed to make sure there's forward progress. Quoting Michal: > > "The notifier is called from quite a restricted context - oom_reaper - > which shouldn't depend on any locks or sleepable conditionals. The code > should be swift as well but we mostly do care about it to make a forward > progress. Checking for sleepable context is the best thing we could come > up with that would describe these demands at least partially." > > Peter also asked whether we want to catch spinlocks on top, but Michal > said those are less of a problem because spinlocks can't have an > indirect dependency upon the page allocator and hence close the loop > with the oom reaper. I continue to struggle with this. It introduces a new kernel state "running preemptibly but must not call schedule()". How does this make any sense? Perhaps a much, much more detailed description of the oom_reaper situation would help out.
On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > In some special cases we must not block, but there's not a > spinlock, preempt-off, irqs-off or similar critical section already > that arms the might_sleep() debug checks. Add a non_block_start/end() > pair to annotate these. > > This will be used in the oom paths of mmu-notifiers, where blocking is > not allowed to make sure there's forward progress. Quoting Michal: > > "The notifier is called from quite a restricted context - oom_reaper - > which shouldn't depend on any locks or sleepable conditionals. The code > should be swift as well but we mostly do care about it to make a forward > progress. Checking for sleepable context is the best thing we could come > up with that would describe these demands at least partially." But this describes fs_reclaim_acquire() - is there some reason we are conflating fs_reclaim with non-sleeping? ie is there some fundamental difference between the block stack sleeping during reclaim while it waits for a driver to write out a page and a GPU driver sleeping during OOM while it waits for it's HW to fence DMA on a page? Fundamentally we have invalidate_range_start() vs invalidate_range() as the start() version is able to sleep. If drivers can do their work without sleeping then they should be using invalidare_range() instead. Thus, it doesn't seem to make any sense to ask a driver that requires a sleeping API not to sleep. AFAICT what is really going on here is that drivers care about only a subset of the VA space, and we want to query the driver if it cares about the range proposed to be OOM'd, so we can OOM ranges that are do not have SPTEs. ie if you look pretty much all drivers do exactly as userptr_mn_invalidate_range_start() does, and bail once they detect the VA range is of interest. So, I'm working on a patch to lift the interval tree into the notifier core and then do the VA test OOM needs without bothering the driver. Drivers can retain the blocking API they require and OOM can work on VA's that don't have SPTEs. This approach also solves the critical bug in this path: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ And solves a bunch of other bugs in the drivers. > Peter also asked whether we want to catch spinlocks on top, but Michal > said those are less of a problem because spinlocks can't have an > indirect dependency upon the page allocator and hence close the loop > with the oom reaper. Again, this entirely sounds like fs_reclaim - isn't that exactly what it is for? I have had on my list a second and very related possible bug. I ran into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance") which says that mapping->i_mmap_mutex is under fs_reclaim(). We do hold i_mmap_rwsem while calling invalidate_range_start(): unmap_mapping_pages i_mmap_lock_write(mapping); // ie i_mmap_rwsem unmap_mapping_range_tree unmap_mapping_range_vma zap_page_range_single mmu_notifier_invalidate_range_start So, if it is still true that i_mmap_rwsem is under fs_reclaim then invalidate_range_start is *always* under fs_reclaim anyhow! (this I do not know) Thus we should use lockdep to force this and fix all the drivers. .. and if we force fs_reclaim always, do we care about blockable anymore?? Jason
On Wed, Aug 14, 2019 at 01:45:58PM -0700, Andrew Morton wrote: > On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > I continue to struggle with this. It introduces a new kernel state > "running preemptibly but must not call schedule()". How does this make > any sense? > > Perhaps a much, much more detailed description of the oom_reaper > situation would help out. I agree on both points, but I guess I'm not the expert to explain why we have this. All I'm trying to do is that drivers hold up their side. If you want to have better documentation for why the oom case needs this special new mode, you're looking at the wrong guy for that :-) Of course if you folks all decide that you just don't want to be remembered about that I guess we can drop this one here, but you're just shooting the messenger I think. -Daniel
On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > But this describes fs_reclaim_acquire() - is there some reason we are > conflating fs_reclaim with non-sleeping? No idea why you tie this into fs_reclaim. We can definitly sleep in there, and for e.g. kswapd (which also wraps everything in fs_reclaim) we're event supposed to I thought. To make sure we can get at the last bit of memory by flushing all the queues and waiting for everything to be cleaned out. > ie is there some fundamental difference between the block stack > sleeping during reclaim while it waits for a driver to write out a > page and a GPU driver sleeping during OOM while it waits for it's HW > to fence DMA on a page? > > Fundamentally we have invalidate_range_start() vs invalidate_range() > as the start() version is able to sleep. If drivers can do their work > without sleeping then they should be using invalidare_range() instead. > > Thus, it doesn't seem to make any sense to ask a driver that requires a > sleeping API not to sleep. > > AFAICT what is really going on here is that drivers care about only a > subset of the VA space, and we want to query the driver if it cares > about the range proposed to be OOM'd, so we can OOM ranges that are > do not have SPTEs. > > ie if you look pretty much all drivers do exactly as > userptr_mn_invalidate_range_start() does, and bail once they detect > the VA range is of interest. > > So, I'm working on a patch to lift the interval tree into the notifier > core and then do the VA test OOM needs without bothering the > driver. Drivers can retain the blocking API they require and OOM can > work on VA's that don't have SPTEs. Hm I figured the point of hmm_mirror is to have that interval tree for everyone (among other things). But yeah lifting to mmu_notifier sounds like a clean solution for this, but I really have not much clue about why we even have this for special mode in the oom case. I'm just trying to increase the odds that drivers hold up their end of the bargain. > This approach also solves the critical bug in this path: > https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > > And solves a bunch of other bugs in the drivers. > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > Again, this entirely sounds like fs_reclaim - isn't that exactly what > it is for? > > I have had on my list a second and very related possible bug. I ran > into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in > advance") which says that mapping->i_mmap_mutex is under fs_reclaim(). > > We do hold i_mmap_rwsem while calling invalidate_range_start(): > > unmap_mapping_pages > i_mmap_lock_write(mapping); // ie i_mmap_rwsem > unmap_mapping_range_tree > unmap_mapping_range_vma > zap_page_range_single > mmu_notifier_invalidate_range_start > > So, if it is still true that i_mmap_rwsem is under fs_reclaim then > invalidate_range_start is *always* under fs_reclaim anyhow! (this I do > not know) > > Thus we should use lockdep to force this and fix all the drivers. > > .. and if we force fs_reclaim always, do we care about blockable > anymore?? Still not sure what fs_reclaim matters here. One of the later patches steals the fs_reclaim idea for mmu_notifiers, but that ties together completely different paths. -Daniel
On Wed 14-08-19 13:45:58, Andrew Morton wrote: > On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > I continue to struggle with this. It introduces a new kernel state > "running preemptibly but must not call schedule()". How does this make > any sense? > > Perhaps a much, much more detailed description of the oom_reaper > situation would help out. The primary point here is that there is a demand of non blockable mmu notifiers to be called when the oom reaper tears down the address space. As the oom reaper is the primary guarantee of the oom handling forward progress it cannot be blocked on anything that might depend on blockable memory allocations. These are not really easy to track because they might be indirect - e.g. notifier blocks on a lock which other context holds while allocating memory or waiting for a flusher that needs memory to perform its work. If such a blocking state happens that we can end up in a silent hang with an unusable machine. Now we hope for reasonable implementations of mmu notifiers (strong words I know ;) and this should be relatively simple and effective catch all tool to detect something suspicious is going on. Does that make the situation more clear?
On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > In some special cases we must not block, but there's not a > > > spinlock, preempt-off, irqs-off or similar critical section already > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > pair to annotate these. > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > should be swift as well but we mostly do care about it to make a forward > > > progress. Checking for sleepable context is the best thing we could come > > > up with that would describe these demands at least partially." > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > conflating fs_reclaim with non-sleeping? > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > event supposed to I thought. To make sure we can get at the last bit of > memory by flushing all the queues and waiting for everything to be cleaned > out. AFAIK the point of fs_reclaim is to prevent "indirect dependency upon the page allocator" ie a justification that was given this !blockable stuff. For instance: fs_reclaim_acquire() kmalloc(GFP_KERNEL) <- lock dep assertion And further, Michal's concern about indirectness through locks is also handled by lockdep: CPU0 CPU1 mutex_lock() kmalloc(GFP_KERNEL) mutex_unlock() fs_reclaim_acquire() mutex_lock() <- lock dep assertion In other words, to prevent recursion into the page allocator you use fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he explained that it means you can't call the allocator functions in a way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or tolerate allocation failure, or various other things). So, the reason I bring it up is half the justifications you posted for blockable had to do with not recursing into reclaim and deadlocking, and didn't seem to have much to do with blocking. I'm asking if *non-blocking* is really the requirement or if this is just the usual 'do not deadlock on the allocator' thing reclaim paths alread have? Jason
On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > As the oom reaper is the primary guarantee of the oom handling forward > progress it cannot be blocked on anything that might depend on blockable > memory allocations. These are not really easy to track because they > might be indirect - e.g. notifier blocks on a lock which other context > holds while allocating memory or waiting for a flusher that needs memory > to perform its work. But lockdep *does* track all this and fs_reclaim_acquire() was created to solve exactly this problem. fs_reclaim is a lock and it flows through all the usual lockdep schemes like any other lock. Any time the page allocator wants to do something the would deadlock with reclaim it takes the lock. Failure is expressed by a deadlock cycle in the lockdep map, and lockdep can handle arbitary complexity through layers of locks, work queues, threads, etc. What is missing? Jason
On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > As the oom reaper is the primary guarantee of the oom handling forward > > progress it cannot be blocked on anything that might depend on blockable > > memory allocations. These are not really easy to track because they > > might be indirect - e.g. notifier blocks on a lock which other context > > holds while allocating memory or waiting for a flusher that needs memory > > to perform its work. > > But lockdep *does* track all this and fs_reclaim_acquire() was created > to solve exactly this problem. > > fs_reclaim is a lock and it flows through all the usual lockdep > schemes like any other lock. Any time the page allocator wants to do > something the would deadlock with reclaim it takes the lock. > > Failure is expressed by a deadlock cycle in the lockdep map, and > lockdep can handle arbitary complexity through layers of locks, work > queues, threads, etc. > > What is missing? Lockdep doens't seen everything by far. E.g. a wait_event will be caught by the annotations here, but not by lockdep. Plus lockdep does not see through the wait_event, and doesn't realize if e.g. that event will never signal because the worker is part of the deadlock loop. cross-release was supposed to fix that, but seems like that will never land. And since we're talking about mmu notifiers here and gpus/dma engines. We have dma_fence_wait, which can wait for any hw/driver in the system that takes part in shared/zero-copy buffer processing. Which at least on the graphics side is everything. This pulls in enormous amounts of deadlock potential that lockdep simply is blind about and will never see. Arming might_sleep catches them all. Cheers, Daniel PS: Don't ask me about why we need these semantics for oom_reaper, like I said just trying to follow the rules.
On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > In some special cases we must not block, but there's not a > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > pair to annotate these. > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > should be swift as well but we mostly do care about it to make a forward > > > > progress. Checking for sleepable context is the best thing we could come > > > > up with that would describe these demands at least partially." > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > conflating fs_reclaim with non-sleeping? > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > event supposed to I thought. To make sure we can get at the last bit of > > memory by flushing all the queues and waiting for everything to be cleaned > > out. > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > the page allocator" ie a justification that was given this !blockable > stuff. > > For instance: > > fs_reclaim_acquire() > kmalloc(GFP_KERNEL) <- lock dep assertion > > And further, Michal's concern about indirectness through locks is also > handled by lockdep: > > CPU0 CPU1 > mutex_lock() > kmalloc(GFP_KERNEL) > mutex_unlock() > fs_reclaim_acquire() > mutex_lock() <- lock dep assertion > > In other words, to prevent recursion into the page allocator you use > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about any !GFP_NOWAIT allocation context here and any {in}direct dependency on it. Whether fs_reclaim_acquire can be reused for that I do not know because I am not familiar with the lockdep machinery enough > I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he > explained that it means you can't call the allocator functions in a > way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or > tolerate allocation failure, or various other things). > > So, the reason I bring it up is half the justifications you posted for > blockable had to do with not recursing into reclaim and deadlocking, > and didn't seem to have much to do with blocking. > > I'm asking if *non-blocking* is really the requirement or if this is > just the usual 'do not deadlock on the allocator' thing reclaim paths > alread have? No, non-blocking is a very coarse approximation of what we really need. But it should give us even a stronger condition. Essentially any sleep other than a preemption shouldn't be allowed in that context.
On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > As the oom reaper is the primary guarantee of the oom handling forward > > progress it cannot be blocked on anything that might depend on blockable > > memory allocations. These are not really easy to track because they > > might be indirect - e.g. notifier blocks on a lock which other context > > holds while allocating memory or waiting for a flusher that needs memory > > to perform its work. > > But lockdep *does* track all this and fs_reclaim_acquire() was created > to solve exactly this problem. > > fs_reclaim is a lock and it flows through all the usual lockdep > schemes like any other lock. Any time the page allocator wants to do > something the would deadlock with reclaim it takes the lock. Our emails have crossed. Even if fs_reclaim can be re-purposed for other than FS/IO reclaim contexts which I am not sure about I think that lockdep is too heavy weight for the purpose of this debugging aid in the first place. The non block mode is really something as simple as "hey mmu notifier you are only allowed to do a lightweight processing, if you cannot guarantee that then just back off". The annotation will give us a warning if the notifier gets out of this promise.
On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote: > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > > In some special cases we must not block, but there's not a > > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > > pair to annotate these. > > > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > > should be swift as well but we mostly do care about it to make a forward > > > > > progress. Checking for sleepable context is the best thing we could come > > > > > up with that would describe these demands at least partially." > > > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > > conflating fs_reclaim with non-sleeping? > > > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > > event supposed to I thought. To make sure we can get at the last bit of > > > memory by flushing all the queues and waiting for everything to be cleaned > > > out. > > > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > > the page allocator" ie a justification that was given this !blockable > > stuff. > > > > For instance: > > > > fs_reclaim_acquire() > > kmalloc(GFP_KERNEL) <- lock dep assertion > > > > And further, Michal's concern about indirectness through locks is also > > handled by lockdep: > > > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- lock dep assertion > > > > In other words, to prevent recursion into the page allocator you use > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. > > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about > any !GFP_NOWAIT allocation context here and any {in}direct dependency on > it. AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and __GFP_DIRECT_RECLAIM.. This matches the existing test in __need_fs_reclaim() - so if you are OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), allocations during OOM, then I think fs_reclaim already matches what you described? > Whether fs_reclaim_acquire can be reused for that I do not know > because I am not familiar with the lockdep machinery enough Well, if fs_reclaim is not already testing the flags you want, then we could add another lockdep map that does. The basic principle is the same, if you want to detect and prevent recursion into the allocator under certain GFP flags then then AFAIK lockdep is the best tool we have. > No, non-blocking is a very coarse approximation of what we really need. > But it should give us even a stronger condition. Essentially any sleep > other than a preemption shouldn't be allowed in that context. But it is a nonsense API to give the driver invalidate_range_start, the blocking alternative to the non-blocking invalidate_range and then demand it to be non-blocking. Inspecting the code, no drivers are actually able to progress their side in non-blocking mode. The best we got was drivers tested the VA range and returned success if they had no interest. Which is a big win to be sure, but it looks like getting any more is not really posssible. However, we could (probably even should) make the drivers fs_reclaim safe. If that is enough to guarantee progress of OOM, then lets consider something like using current_gfp_context() to force PF_MEMALLOC_NOFS allocation behavior on the driver callback and lockdep to try and keep pushing on the the debugging, and dropping !blocking. Jason
On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > > > As the oom reaper is the primary guarantee of the oom handling forward > > > progress it cannot be blocked on anything that might depend on blockable > > > memory allocations. These are not really easy to track because they > > > might be indirect - e.g. notifier blocks on a lock which other context > > > holds while allocating memory or waiting for a flusher that needs memory > > > to perform its work. > > > > But lockdep *does* track all this and fs_reclaim_acquire() was created > > to solve exactly this problem. > > > > fs_reclaim is a lock and it flows through all the usual lockdep > > schemes like any other lock. Any time the page allocator wants to do > > something the would deadlock with reclaim it takes the lock. > > > > Failure is expressed by a deadlock cycle in the lockdep map, and > > lockdep can handle arbitary complexity through layers of locks, work > > queues, threads, etc. > > > > What is missing? > > Lockdep doens't seen everything by far. E.g. a wait_event will be > caught by the annotations here, but not by lockdep. Sure, but the wait_event might be OK if its progress isn't contingent on fs_reclaim, ie triggered from interrupt, so why ban it? > And since we're talking about mmu notifiers here and gpus/dma engines. > We have dma_fence_wait, which can wait for any hw/driver in the system > that takes part in shared/zero-copy buffer processing. Which at least > on the graphics side is everything. This pulls in enormous amounts of > deadlock potential that lockdep simply is blind about and will never > see. It seems very risky to entagle a notifier widely like that. It looks pretty sure that notifiers are fs_reclaim, so at a minimum that wait_event can't be contingent on anything that is doing GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep safe. Avoiding an uncertain wait_event under notifiers would seem to be the only reasonable design here.. Jason
On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote: > > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > > > > > As the oom reaper is the primary guarantee of the oom handling forward > > > > progress it cannot be blocked on anything that might depend on blockable > > > > memory allocations. These are not really easy to track because they > > > > might be indirect - e.g. notifier blocks on a lock which other context > > > > holds while allocating memory or waiting for a flusher that needs memory > > > > to perform its work. > > > > > > But lockdep *does* track all this and fs_reclaim_acquire() was created > > > to solve exactly this problem. > > > > > > fs_reclaim is a lock and it flows through all the usual lockdep > > > schemes like any other lock. Any time the page allocator wants to do > > > something the would deadlock with reclaim it takes the lock. > > > > > > Failure is expressed by a deadlock cycle in the lockdep map, and > > > lockdep can handle arbitary complexity through layers of locks, work > > > queues, threads, etc. > > > > > > What is missing? > > > > Lockdep doens't seen everything by far. E.g. a wait_event will be > > caught by the annotations here, but not by lockdep. > > Sure, but the wait_event might be OK if its progress isn't contingent > on fs_reclaim, ie triggered from interrupt, so why ban it? For normal notifiers sure (but lockdep won't help you proof that at all). For oom_reaper apparently not, but that's really Michal's thing, I have not much idea about that. > > And since we're talking about mmu notifiers here and gpus/dma engines. > > We have dma_fence_wait, which can wait for any hw/driver in the system > > that takes part in shared/zero-copy buffer processing. Which at least > > on the graphics side is everything. This pulls in enormous amounts of > > deadlock potential that lockdep simply is blind about and will never > > see. > > It seems very risky to entagle a notifier widely like that. That's why I've looked into all possible ways to annotate them with debug infrastructure. > It looks pretty sure that notifiers are fs_reclaim, so at a minimum > that wait_event can't be contingent on anything that is doing > GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep > safe. > > Avoiding an uncertain wait_event under notifiers would seem to be the > only reasonable design here.. You have to wait for the gpu to finnish current processing in invalidate_range_start. Otherwise there's no point to any of this really. So the wait_event/dma_fence_wait are unavoidable really. That's also why I'm throwing in the lockdep annotation on top, and why it would be really nice if we somehow can get the cross-release work landed. But it looks like all the people who could make it happen are busy with smeltdown :-/ -Daniel
On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > You have to wait for the gpu to finnish current processing in > invalidate_range_start. Otherwise there's no point to any of this > really. So the wait_event/dma_fence_wait are unavoidable really. I don't envy your task :| But, what you describe sure sounds like a 'registration cache' model, not the 'shadow pte' model of coherency. The key difference is that a regirstationcache is allowed to become incoherent with the VMA's because it holds page pins. It is a programming bug in userspace to change VA mappings via mmap/munmap/etc while the device is working on that VA, but it does not harm system integrity because of the page pin. The cache ensures that each initiated operation sees a DMA setup that matches the current VA map when the operation is initiated and allows expensive device DMA setups to be re-used. A 'shadow pte' model (ie hmm) *really* needs device support to directly block DMA access - ie trigger 'device page fault'. ie the invalidate_start should inform the device to enter a fault mode and that is it. If the device can't do that, then the driver probably shouldn't persue this level of coherency. The driver would quickly get into the messy locking problems like dma_fence_wait from a notifier. It is important to identify what model you are going for as defining a 'registration cache' coherence expectation allows the driver to skip blocking in invalidate_range_start. All it does is invalidate the cache so that future operations pick up the new VA mapping. Intel's HFI RDMA driver uses this model extensively, and I think it is well proven, within some limitations of course. At least, 'registration cache' is the only use model I know of where it is acceptable to skip invalidate_range_end. Jason
On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote: > > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > > > In some special cases we must not block, but there's not a > > > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > > > pair to annotate these. > > > > > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > > > should be swift as well but we mostly do care about it to make a forward > > > > > > progress. Checking for sleepable context is the best thing we could come > > > > > > up with that would describe these demands at least partially." > > > > > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > > > conflating fs_reclaim with non-sleeping? > > > > > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > > > event supposed to I thought. To make sure we can get at the last bit of > > > > memory by flushing all the queues and waiting for everything to be cleaned > > > > out. > > > > > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > > > the page allocator" ie a justification that was given this !blockable > > > stuff. > > > > > > For instance: > > > > > > fs_reclaim_acquire() > > > kmalloc(GFP_KERNEL) <- lock dep assertion > > > > > > And further, Michal's concern about indirectness through locks is also > > > handled by lockdep: > > > > > > CPU0 CPU1 > > > mutex_lock() > > > kmalloc(GFP_KERNEL) > > > mutex_unlock() > > > fs_reclaim_acquire() > > > mutex_lock() <- lock dep assertion > > > > > > In other words, to prevent recursion into the page allocator you use > > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. > > > > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about > > any !GFP_NOWAIT allocation context here and any {in}direct dependency on > > it. > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > __GFP_DIRECT_RECLAIM.. > > This matches the existing test in __need_fs_reclaim() - so if you are > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > allocations during OOM, then I think fs_reclaim already matches what > you described? No GFP_NOFS is equally bad. Please read my other email explaining what the oom_reaper actually requires. In short no blocking on direct or indirect dependecy on memory allocation that might sleep. If you can express that in the existing lockdep machinery. All fine. But then consider deployments where lockdep is no-no because of the overhead. > > No, non-blocking is a very coarse approximation of what we really need. > > But it should give us even a stronger condition. Essentially any sleep > > other than a preemption shouldn't be allowed in that context. > > But it is a nonsense API to give the driver invalidate_range_start, > the blocking alternative to the non-blocking invalidate_range and then > demand it to be non-blocking. > > Inspecting the code, no drivers are actually able to progress their > side in non-blocking mode. > > The best we got was drivers tested the VA range and returned success > if they had no interest. Which is a big win to be sure, but it looks > like getting any more is not really posssible. And that is already a great win! Because many notifiers only do care about particular mappings. Please note that backing off unconditioanlly will simply cause that the oom reaper will have to back off not doing any tear down anything. > However, we could (probably even should) make the drivers fs_reclaim > safe. > > If that is enough to guarantee progress of OOM, then lets consider > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > allocation behavior on the driver callback and lockdep to try and keep > pushing on the the debugging, and dropping !blocking. How are you going to enforce indirect dependency? E.g. a lock that is also used in other context which depend on sleepable memory allocation to move forward. Really, this was aimed to give a very simple debugging aid. If it is considered to be so controversial, even though I really do not see why, then let's just drop it on the floor.
On Thu, Aug 15, 2019 at 5:10 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > You have to wait for the gpu to finnish current processing in > > invalidate_range_start. Otherwise there's no point to any of this > > really. So the wait_event/dma_fence_wait are unavoidable really. > > I don't envy your task :| > > But, what you describe sure sounds like a 'registration cache' model, > not the 'shadow pte' model of coherency. > > The key difference is that a regirstationcache is allowed to become > incoherent with the VMA's because it holds page pins. It is a > programming bug in userspace to change VA mappings via mmap/munmap/etc > while the device is working on that VA, but it does not harm system > integrity because of the page pin. > > The cache ensures that each initiated operation sees a DMA setup that > matches the current VA map when the operation is initiated and allows > expensive device DMA setups to be re-used. > > A 'shadow pte' model (ie hmm) *really* needs device support to > directly block DMA access - ie trigger 'device page fault'. ie the > invalidate_start should inform the device to enter a fault mode and > that is it. If the device can't do that, then the driver probably > shouldn't persue this level of coherency. The driver would quickly get > into the messy locking problems like dma_fence_wait from a notifier. > > It is important to identify what model you are going for as defining a > 'registration cache' coherence expectation allows the driver to skip > blocking in invalidate_range_start. All it does is invalidate the > cache so that future operations pick up the new VA mapping. > > Intel's HFI RDMA driver uses this model extensively, and I think it is > well proven, within some limitations of course. > > At least, 'registration cache' is the only use model I know of where > it is acceptable to skip invalidate_range_end. I'm not really well versed in the details of our userptr, but both amdgpu and i915 wait for the gpu to complete from invalidate_range_start. Jerome has at least looked a lot at the amdgpu one, so maybe he can explain what exactly it is we're doing ... -Daniel
On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > You have to wait for the gpu to finnish current processing in > > invalidate_range_start. Otherwise there's no point to any of this > > really. So the wait_event/dma_fence_wait are unavoidable really. > > I don't envy your task :| > > But, what you describe sure sounds like a 'registration cache' model, > not the 'shadow pte' model of coherency. > > The key difference is that a regirstationcache is allowed to become > incoherent with the VMA's because it holds page pins. It is a > programming bug in userspace to change VA mappings via mmap/munmap/etc > while the device is working on that VA, but it does not harm system > integrity because of the page pin. > > The cache ensures that each initiated operation sees a DMA setup that > matches the current VA map when the operation is initiated and allows > expensive device DMA setups to be re-used. > > A 'shadow pte' model (ie hmm) *really* needs device support to > directly block DMA access - ie trigger 'device page fault'. ie the > invalidate_start should inform the device to enter a fault mode and > that is it. If the device can't do that, then the driver probably > shouldn't persue this level of coherency. The driver would quickly get > into the messy locking problems like dma_fence_wait from a notifier. I think here we do not agree on the hardware requirement. For GPU we will always need to be able to wait for some GPU fence from inside the notifier callback, there is just no way around that for many of the GPUs today (i do not see any indication of that changing). Driver should avoid lock complexity by using wait queue so that the driver notifier callback can wait without having to hold some driver lock. However there will be at least one lock needed to update the internal driver state for the range being invalidated. That lock is just the device driver page table lock for the GPU page table associated with the mm_struct. In all GPU driver so far it is a short lived lock and nothing blocking is done while holding it (it is just about updating page table directory really wether it is filling it or clearing it). > > It is important to identify what model you are going for as defining a > 'registration cache' coherence expectation allows the driver to skip > blocking in invalidate_range_start. All it does is invalidate the > cache so that future operations pick up the new VA mapping. > > Intel's HFI RDMA driver uses this model extensively, and I think it is > well proven, within some limitations of course. > > At least, 'registration cache' is the only use model I know of where > it is acceptable to skip invalidate_range_end. Here GPU are not in the registration cache model, i know it might looks like it because of GUP but GUP was use just because hmm did not exist at the time. Cheers, Jérôme
On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > __GFP_DIRECT_RECLAIM.. > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > allocations during OOM, then I think fs_reclaim already matches what > > you described? > > No GFP_NOFS is equally bad. Please read my other email explaining what > the oom_reaper actually requires. In short no blocking on direct or > indirect dependecy on memory allocation that might sleep. It is much easier to follow with some hints on code, so the true requirement is that the OOM repear not block on GFP_FS and GFP_IO allocations, great, that constraint is now clear. > If you can express that in the existing lockdep machinery. All > fine. But then consider deployments where lockdep is no-no because > of the overhead. This is all for driver debugging. The point of lockdep is to find all these paths without have to hit them as actual races, using debug kernels. I don't think we need this kind of debugging on production kernels? > > The best we got was drivers tested the VA range and returned success > > if they had no interest. Which is a big win to be sure, but it looks > > like getting any more is not really posssible. > > And that is already a great win! Because many notifiers only do care > about particular mappings. Please note that backing off unconditioanlly > will simply cause that the oom reaper will have to back off not doing > any tear down anything. Well, I'm working to propose that we do the VA range test under core mmu notifier code that cannot block and then we simply remove the idea of blockable from drivers using this new 'range notifier'. I think this pretty much solves the concern? > > However, we could (probably even should) make the drivers fs_reclaim > > safe. > > > > If that is enough to guarantee progress of OOM, then lets consider > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > allocation behavior on the driver callback and lockdep to try and keep > > pushing on the the debugging, and dropping !blocking. > > How are you going to enforce indirect dependency? E.g. a lock that is > also used in other context which depend on sleepable memory allocation > to move forward. You mean like this: CPU0 CPU1 mutex_lock() kmalloc(GFP_KERNEL) mutex_unlock() fs_reclaim_acquire() mutex_lock() <- illegal: lock dep assertion ? lockdep handles this - that is what it does, it builds a graph of all lock dependencies and requires the graph to be acyclic. So mutex_lock depends on fs_reclaim_lock on CPU1 while on CPU0, fs_reclaim_lock depends on mutex_lock. This is an ABBA locking cycle and lockdep will reliably trigger. So, if we wanted to do this, I'd probably suggest we retool fs_reclaim's interface be more like prevent_gfp_flags(__GFP_IO | __GFP_FS); [..] restore_gfp_flags(__GFP_IO | __GFP_FS); Which is lockdep based and follows the indirect lock dependencies you talked about. Then OOM and reclaim can specify the different flags they want blocked. We could probably use the same API with WQ_MEM_RECLAIM as I was chatting with Tejun about: https://www.spinics.net/lists/linux-rdma/msg77362.html IMHO this stuff is *super complicated* for those of us outside the mm subsystem, so having some really clear annotations like the above would go a long way to help understand these special constraints. I'm pretty sure there are lots of driver bugs related to using the wrong GFP flags in the kernel. > Really, this was aimed to give a very simple debugging aid. If it is > considered to be so controversial, even though I really do not see why, > then let's just drop it on the floor. My concern is that the requirement was very unclear. I'm trying to understand all the bits of how these notifiers work and the exact semantic of this OOM path have been vauge if it is really some GFP flag restriction or truely related to not sleeping. Jason
On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > __GFP_DIRECT_RECLAIM.. > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > allocations during OOM, then I think fs_reclaim already matches what > > > you described? > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > the oom_reaper actually requires. In short no blocking on direct or > > indirect dependecy on memory allocation that might sleep. > > It is much easier to follow with some hints on code, so the true > requirement is that the OOM repear not block on GFP_FS and GFP_IO > allocations, great, that constraint is now clear. > > > If you can express that in the existing lockdep machinery. All > > fine. But then consider deployments where lockdep is no-no because > > of the overhead. > > This is all for driver debugging. The point of lockdep is to find all > these paths without have to hit them as actual races, using debug > kernels. > > I don't think we need this kind of debugging on production kernels? > > > > The best we got was drivers tested the VA range and returned success > > > if they had no interest. Which is a big win to be sure, but it looks > > > like getting any more is not really posssible. > > > > And that is already a great win! Because many notifiers only do care > > about particular mappings. Please note that backing off unconditioanlly > > will simply cause that the oom reaper will have to back off not doing > > any tear down anything. > > Well, I'm working to propose that we do the VA range test under core > mmu notifier code that cannot block and then we simply remove the idea > of blockable from drivers using this new 'range notifier'. > > I think this pretty much solves the concern? I am not sure i follow what you propose here ? Like i pointed out in another email for GPU we do need to be able to sleep (we might get lucky and not need too but this is runtime thing) within notifier range_start callback. This has been something allow by notifier since it has been introduced in the kernel. Cheers, Jérôme
On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > You have to wait for the gpu to finnish current processing in > > > invalidate_range_start. Otherwise there's no point to any of this > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > I don't envy your task :| > > > > But, what you describe sure sounds like a 'registration cache' model, > > not the 'shadow pte' model of coherency. > > > > The key difference is that a regirstationcache is allowed to become > > incoherent with the VMA's because it holds page pins. It is a > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > while the device is working on that VA, but it does not harm system > > integrity because of the page pin. > > > > The cache ensures that each initiated operation sees a DMA setup that > > matches the current VA map when the operation is initiated and allows > > expensive device DMA setups to be re-used. > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > directly block DMA access - ie trigger 'device page fault'. ie the > > invalidate_start should inform the device to enter a fault mode and > > that is it. If the device can't do that, then the driver probably > > shouldn't persue this level of coherency. The driver would quickly get > > into the messy locking problems like dma_fence_wait from a notifier. > > I think here we do not agree on the hardware requirement. For GPU > we will always need to be able to wait for some GPU fence from inside > the notifier callback, there is just no way around that for many of > the GPUs today (i do not see any indication of that changing). I didn't say you couldn't wait, I was trying to say that the wait should only be contigent on the HW itself. Ie you can wait on a GPU page table lock, and you can wait on a GPU page table flush completion via IRQ. What is troubling is to wait till some other thread gets a GPU command completion and decr's a kref on the DMA buffer - which kinda looks like what this dma_fence() stuff is all about. A driver like that would have to be super careful to ensure consistent forward progress toward dma ref == 0 when the system is under reclaim. ie by running it's entire IRQ flow under fs_reclaim locking. > associated with the mm_struct. In all GPU driver so far it is a short > lived lock and nothing blocking is done while holding it (it is just > about updating page table directory really wether it is filling it or > clearing it). The main blocking I expect in a shadow PTE flow is waiting for the HW to complete invalidations of its PTE cache. > > It is important to identify what model you are going for as defining a > > 'registration cache' coherence expectation allows the driver to skip > > blocking in invalidate_range_start. All it does is invalidate the > > cache so that future operations pick up the new VA mapping. > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > well proven, within some limitations of course. > > > > At least, 'registration cache' is the only use model I know of where > > it is acceptable to skip invalidate_range_end. > > Here GPU are not in the registration cache model, i know it might looks > like it because of GUP but GUP was use just because hmm did not exist > at the time. It is not because of GUP, it is because of the lack of invalidate_range_end. A driver cannot correctly implement the SPTE model without invalidate_range_end, even if it holds the page pins via GUP. So, I've been assuming the few drivers without invalidate_range_end are trying to do registration caching, rather than assuming they are broken. Jason
On Thu, Aug 15, 2019 at 01:11:56PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > you described? > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > the oom_reaper actually requires. In short no blocking on direct or > > > indirect dependecy on memory allocation that might sleep. > > > > It is much easier to follow with some hints on code, so the true > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > allocations, great, that constraint is now clear. > > > > > If you can express that in the existing lockdep machinery. All > > > fine. But then consider deployments where lockdep is no-no because > > > of the overhead. > > > > This is all for driver debugging. The point of lockdep is to find all > > these paths without have to hit them as actual races, using debug > > kernels. > > > > I don't think we need this kind of debugging on production kernels? > > > > > > The best we got was drivers tested the VA range and returned success > > > > if they had no interest. Which is a big win to be sure, but it looks > > > > like getting any more is not really posssible. > > > > > > And that is already a great win! Because many notifiers only do care > > > about particular mappings. Please note that backing off unconditioanlly > > > will simply cause that the oom reaper will have to back off not doing > > > any tear down anything. > > > > Well, I'm working to propose that we do the VA range test under core > > mmu notifier code that cannot block and then we simply remove the idea > > of blockable from drivers using this new 'range notifier'. > > > > I think this pretty much solves the concern? > > I am not sure i follow what you propose here ? Like i pointed out in > another email for GPU we do need to be able to sleep (we might get > lucky and not need too but this is runtime thing) within notifier > range_start callback. This has been something allow by notifier since > it has been introduced in the kernel. Sorry, I mean remove the idea of the blockable flag from the drivers. Drivers will always be able to block, within the existing limitation of fs_reclaim Jason
On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > > > You have to wait for the gpu to finnish current processing in > > > > invalidate_range_start. Otherwise there's no point to any of this > > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > > > I don't envy your task :| > > > > > > But, what you describe sure sounds like a 'registration cache' model, > > > not the 'shadow pte' model of coherency. > > > > > > The key difference is that a regirstationcache is allowed to become > > > incoherent with the VMA's because it holds page pins. It is a > > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > > while the device is working on that VA, but it does not harm system > > > integrity because of the page pin. > > > > > > The cache ensures that each initiated operation sees a DMA setup that > > > matches the current VA map when the operation is initiated and allows > > > expensive device DMA setups to be re-used. > > > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > > directly block DMA access - ie trigger 'device page fault'. ie the > > > invalidate_start should inform the device to enter a fault mode and > > > that is it. If the device can't do that, then the driver probably > > > shouldn't persue this level of coherency. The driver would quickly get > > > into the messy locking problems like dma_fence_wait from a notifier. > > > > I think here we do not agree on the hardware requirement. For GPU > > we will always need to be able to wait for some GPU fence from inside > > the notifier callback, there is just no way around that for many of > > the GPUs today (i do not see any indication of that changing). > > I didn't say you couldn't wait, I was trying to say that the wait > should only be contigent on the HW itself. Ie you can wait on a GPU > page table lock, and you can wait on a GPU page table flush completion > via IRQ. > > What is troubling is to wait till some other thread gets a GPU command > completion and decr's a kref on the DMA buffer - which kinda looks > like what this dma_fence() stuff is all about. A driver like that > would have to be super careful to ensure consistent forward progress > toward dma ref == 0 when the system is under reclaim. > > ie by running it's entire IRQ flow under fs_reclaim locking. This is correct. At least for i915 it's already a required due to our shrinker also having to do the same. I think amdgpu isn't bothering with that since they have vram for most of the stuff, and just limit system memory usage to half of all and forgo the shrinker. Probably not the nicest approach. Anyway, both do the same mmu_notifier dance, just want to explain that we've been living with this for longer already. So yeah writing a gpu driver is not easy. > > associated with the mm_struct. In all GPU driver so far it is a short > > lived lock and nothing blocking is done while holding it (it is just > > about updating page table directory really wether it is filling it or > > clearing it). > > The main blocking I expect in a shadow PTE flow is waiting for the HW > to complete invalidations of its PTE cache. > > > > It is important to identify what model you are going for as defining a > > > 'registration cache' coherence expectation allows the driver to skip > > > blocking in invalidate_range_start. All it does is invalidate the > > > cache so that future operations pick up the new VA mapping. > > > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > > well proven, within some limitations of course. > > > > > > At least, 'registration cache' is the only use model I know of where > > > it is acceptable to skip invalidate_range_end. > > > > Here GPU are not in the registration cache model, i know it might looks > > like it because of GUP but GUP was use just because hmm did not exist > > at the time. > > It is not because of GUP, it is because of the lack of > invalidate_range_end. A driver cannot correctly implement the SPTE > model without invalidate_range_end, even if it holds the page pins via > GUP. > > So, I've been assuming the few drivers without invalidate_range_end > are trying to do registration caching, rather than assuming they are > broken. I915 might just be broken. amdgpu does the full thing, using hmm_mirror. But still with dma_fence_wait. -Daniel
On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > > > > > You have to wait for the gpu to finnish current processing in > > > > > invalidate_range_start. Otherwise there's no point to any of this > > > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > > > > > I don't envy your task :| > > > > > > > > But, what you describe sure sounds like a 'registration cache' model, > > > > not the 'shadow pte' model of coherency. > > > > > > > > The key difference is that a regirstationcache is allowed to become > > > > incoherent with the VMA's because it holds page pins. It is a > > > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > > > while the device is working on that VA, but it does not harm system > > > > integrity because of the page pin. > > > > > > > > The cache ensures that each initiated operation sees a DMA setup that > > > > matches the current VA map when the operation is initiated and allows > > > > expensive device DMA setups to be re-used. > > > > > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > > > directly block DMA access - ie trigger 'device page fault'. ie the > > > > invalidate_start should inform the device to enter a fault mode and > > > > that is it. If the device can't do that, then the driver probably > > > > shouldn't persue this level of coherency. The driver would quickly get > > > > into the messy locking problems like dma_fence_wait from a notifier. > > > > > > I think here we do not agree on the hardware requirement. For GPU > > > we will always need to be able to wait for some GPU fence from inside > > > the notifier callback, there is just no way around that for many of > > > the GPUs today (i do not see any indication of that changing). > > > > I didn't say you couldn't wait, I was trying to say that the wait > > should only be contigent on the HW itself. Ie you can wait on a GPU > > page table lock, and you can wait on a GPU page table flush completion > > via IRQ. > > > > What is troubling is to wait till some other thread gets a GPU command > > completion and decr's a kref on the DMA buffer - which kinda looks > > like what this dma_fence() stuff is all about. A driver like that > > would have to be super careful to ensure consistent forward progress > > toward dma ref == 0 when the system is under reclaim. > > > > ie by running it's entire IRQ flow under fs_reclaim locking. > > This is correct. At least for i915 it's already a required due to our > shrinker also having to do the same. I think amdgpu isn't bothering > with that since they have vram for most of the stuff, and just limit > system memory usage to half of all and forgo the shrinker. Probably > not the nicest approach. Anyway, both do the same mmu_notifier dance, > just want to explain that we've been living with this for longer > already. > > So yeah writing a gpu driver is not easy. > > > > associated with the mm_struct. In all GPU driver so far it is a short > > > lived lock and nothing blocking is done while holding it (it is just > > > about updating page table directory really wether it is filling it or > > > clearing it). > > > > The main blocking I expect in a shadow PTE flow is waiting for the HW > > to complete invalidations of its PTE cache. > > > > > > It is important to identify what model you are going for as defining a > > > > 'registration cache' coherence expectation allows the driver to skip > > > > blocking in invalidate_range_start. All it does is invalidate the > > > > cache so that future operations pick up the new VA mapping. > > > > > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > > > well proven, within some limitations of course. > > > > > > > > At least, 'registration cache' is the only use model I know of where > > > > it is acceptable to skip invalidate_range_end. > > > > > > Here GPU are not in the registration cache model, i know it might looks > > > like it because of GUP but GUP was use just because hmm did not exist > > > at the time. > > > > It is not because of GUP, it is because of the lack of > > invalidate_range_end. A driver cannot correctly implement the SPTE > > model without invalidate_range_end, even if it holds the page pins via > > GUP. > > > > So, I've been assuming the few drivers without invalidate_range_end > > are trying to do registration caching, rather than assuming they are > > broken. > > I915 might just be broken. amdgpu does the full thing, using > hmm_mirror. But still with dma_fence_wait. Yeah i915 is broken but it never hurted anyone ;) I posted patch a long time ago to convert it to hmm but i delayed that to until i can get through making something of GUPfast that can also be use for HMM/ODP user. Cheers, Jérôme
On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote: > I'm not really well versed in the details of our userptr, but both > amdgpu and i915 wait for the gpu to complete from > invalidate_range_start. Jerome has at least looked a lot at the amdgpu > one, so maybe he can explain what exactly it is we're doing ... amdgpu is (wrongly) using hmm for something, I can't really tell what it is trying to do. The calls to dma_fence under the invalidate_range_start do not give me a good feeling. However, i915 shows all the signs of trying to follow the registration cache model, it even has a nice comment in i915_gem_userptr_get_pages() explaining that the races it has don't matter because it is a user space bug to change the VA mapping in the first place. That just screams registration cache to me. So it is fine to run HW that way, but if you do, there is no reason to fence inside the invalidate_range end. Just orphan the DMA buffer and clean it up & release the page pins when all DMA buffer refs go to zero. The next access to that VA should get a new DMA buffer with the right mapping. In other words the invalidation should be very simple without complicated locking, or wait_event's. Look at hfi1 for example. Jason
On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote: > > > I'm not really well versed in the details of our userptr, but both > > amdgpu and i915 wait for the gpu to complete from > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu > > one, so maybe he can explain what exactly it is we're doing ... > > amdgpu is (wrongly) using hmm for something, I can't really tell what > it is trying to do. The calls to dma_fence under the > invalidate_range_start do not give me a good feeling. > > However, i915 shows all the signs of trying to follow the registration > cache model, it even has a nice comment in > i915_gem_userptr_get_pages() explaining that the races it has don't > matter because it is a user space bug to change the VA mapping in the > first place. That just screams registration cache to me. > > So it is fine to run HW that way, but if you do, there is no reason to > fence inside the invalidate_range end. Just orphan the DMA buffer and > clean it up & release the page pins when all DMA buffer refs go to > zero. The next access to that VA should get a new DMA buffer with the > right mapping. > > In other words the invalidation should be very simple without > complicated locking, or wait_event's. Look at hfi1 for example. This would break the today usage model of uptr and it will break userspace expectation ie if GPU is writting to that memory and that memory then the userspace want to make sure that it will see what the GPU write. Yes i915 is broken in respect to not having a end notifier and tracking active invalidation for a range but the GUP side of thing kind of hide this bug and it shrinks the window for bad to happen to something so small that i doubt anyone could ever hit it (still a bug thought). Cheers, Jérôme
On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > __GFP_DIRECT_RECLAIM.. > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > allocations during OOM, then I think fs_reclaim already matches what > > > you described? > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > the oom_reaper actually requires. In short no blocking on direct or > > indirect dependecy on memory allocation that might sleep. > > It is much easier to follow with some hints on code, so the true > requirement is that the OOM repear not block on GFP_FS and GFP_IO > allocations, great, that constraint is now clear. I still do not get why do you put FS/IO into the picture. This is really about __GFP_DIRECT_RECLAIM. > > > If you can express that in the existing lockdep machinery. All > > fine. But then consider deployments where lockdep is no-no because > > of the overhead. > > This is all for driver debugging. The point of lockdep is to find all > these paths without have to hit them as actual races, using debug > kernels. > > I don't think we need this kind of debugging on production kernels? Again, the primary motivation was a simple debugging aid that could be used without worrying about overhead. So lockdep is very often out of the question. > > > The best we got was drivers tested the VA range and returned success > > > if they had no interest. Which is a big win to be sure, but it looks > > > like getting any more is not really posssible. > > > > And that is already a great win! Because many notifiers only do care > > about particular mappings. Please note that backing off unconditioanlly > > will simply cause that the oom reaper will have to back off not doing > > any tear down anything. > > Well, I'm working to propose that we do the VA range test under core > mmu notifier code that cannot block and then we simply remove the idea > of blockable from drivers using this new 'range notifier'. > > I think this pretty much solves the concern? Well, my idea was that a range check and early bail out was a first step and then each specific notifier would be able to do a more specific check. I was not able to do the second step because that requires a deep understanding of the respective subsystem. Really all I do care about is to reclaim as much memory from the oom_reaper context as possible. And that cannot really be an unbounded process. Quite contrary it should be as swift as possible. From my cursory look some notifiers are able to achieve their task without blocking or depending on memory just fine. So bailing out unconditionally on the range of interest would just put us back. > > > However, we could (probably even should) make the drivers fs_reclaim > > > safe. > > > > > > If that is enough to guarantee progress of OOM, then lets consider > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > > allocation behavior on the driver callback and lockdep to try and keep > > > pushing on the the debugging, and dropping !blocking. > > > > How are you going to enforce indirect dependency? E.g. a lock that is > > also used in other context which depend on sleepable memory allocation > > to move forward. > > You mean like this: > > CPU0 CPU1 > mutex_lock() > kmalloc(GFP_KERNEL) no I mean __GFP_DIRECT_RECLAIM here. > mutex_unlock() > fs_reclaim_acquire() > mutex_lock() <- illegal: lock dep assertion I cannot really comment on how that is achieveable by lockdep. I managed to forget details about FS/IO reclaim recursion tracking already and I do not have time to learn it again. It was quite a hack. Anyway, let me repeat that the primary motivation was a simple aid. Not something as poverful as lockdep.
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > you described? > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > the oom_reaper actually requires. In short no blocking on direct or > > > indirect dependecy on memory allocation that might sleep. > > > > It is much easier to follow with some hints on code, so the true > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > allocations, great, that constraint is now clear. > > I still do not get why do you put FS/IO into the picture. This is really > about __GFP_DIRECT_RECLAIM. > > > > > > If you can express that in the existing lockdep machinery. All > > > fine. But then consider deployments where lockdep is no-no because > > > of the overhead. > > > > This is all for driver debugging. The point of lockdep is to find all > > these paths without have to hit them as actual races, using debug > > kernels. > > > > I don't think we need this kind of debugging on production kernels? > > Again, the primary motivation was a simple debugging aid that could be > used without worrying about overhead. So lockdep is very often out of > the question. > > > > > The best we got was drivers tested the VA range and returned success > > > > if they had no interest. Which is a big win to be sure, but it looks > > > > like getting any more is not really posssible. > > > > > > And that is already a great win! Because many notifiers only do care > > > about particular mappings. Please note that backing off unconditioanlly > > > will simply cause that the oom reaper will have to back off not doing > > > any tear down anything. > > > > Well, I'm working to propose that we do the VA range test under core > > mmu notifier code that cannot block and then we simply remove the idea > > of blockable from drivers using this new 'range notifier'. > > > > I think this pretty much solves the concern? > > Well, my idea was that a range check and early bail out was a first step > and then each specific notifier would be able to do a more specific > check. I was not able to do the second step because that requires a deep > understanding of the respective subsystem. > > Really all I do care about is to reclaim as much memory from the > oom_reaper context as possible. And that cannot really be an unbounded > process. Quite contrary it should be as swift as possible. From my > cursory look some notifiers are able to achieve their task without > blocking or depending on memory just fine. So bailing out > unconditionally on the range of interest would just put us back. Agree, OOM just asking the question: can i unmap that page quickly ? so that me (OOM) can swap it out. In many cases the driver need to lookup something to see if at the time the memory is just not in use and can be reclaim right away. So driver should have a path to optimistically update its state to allow quick reclaim. > > > > However, we could (probably even should) make the drivers fs_reclaim > > > > safe. > > > > > > > > If that is enough to guarantee progress of OOM, then lets consider > > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > > > allocation behavior on the driver callback and lockdep to try and keep > > > > pushing on the the debugging, and dropping !blocking. > > > > > > How are you going to enforce indirect dependency? E.g. a lock that is > > > also used in other context which depend on sleepable memory allocation > > > to move forward. > > > > You mean like this: > > > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > no I mean __GFP_DIRECT_RECLAIM here. > > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- illegal: lock dep assertion > > I cannot really comment on how that is achieveable by lockdep. I managed > to forget details about FS/IO reclaim recursion tracking already and I > do not have time to learn it again. It was quite a hack. Anyway, let me > repeat that the primary motivation was a simple aid. Not something as > poverful as lockdep. I feel that the fs_reclaim_acquire() is just too heavy weight here. I do think that Daniel patches improve the debugging situation without burdening anything so i am in favor or merging that. I do not think we should devote too much time into fs_reclaim(), our time would be better spend in improving the driver shrinker for instance after all OOM is all about trying to free-up memory. Cheers, Jérôme
On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote: > > > > > I'm not really well versed in the details of our userptr, but both > > > amdgpu and i915 wait for the gpu to complete from > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu > > > one, so maybe he can explain what exactly it is we're doing ... > > > > amdgpu is (wrongly) using hmm for something, I can't really tell what > > it is trying to do. The calls to dma_fence under the > > invalidate_range_start do not give me a good feeling. > > > > However, i915 shows all the signs of trying to follow the registration > > cache model, it even has a nice comment in > > i915_gem_userptr_get_pages() explaining that the races it has don't > > matter because it is a user space bug to change the VA mapping in the > > first place. That just screams registration cache to me. > > > > So it is fine to run HW that way, but if you do, there is no reason to > > fence inside the invalidate_range end. Just orphan the DMA buffer and > > clean it up & release the page pins when all DMA buffer refs go to > > zero. The next access to that VA should get a new DMA buffer with the > > right mapping. > > > > In other words the invalidation should be very simple without > > complicated locking, or wait_event's. Look at hfi1 for example. > > This would break the today usage model of uptr and it will > break userspace expectation ie if GPU is writting to that > memory and that memory then the userspace want to make sure > that it will see what the GPU write. How exactly? This is holding the page pin, so the only way the VA mapping can be changed is via explicit user action. ie: gpu_write_something(va, size) mmap(.., va, size, MMAP_FIXED); gpu_wait_done() This is racy and indeterminate with both models. Based on the comment in i915 it appears to be going on the model that changes to the mmap by userspace when the GPU is working on it is a programming bug. This is reasonable, lots of systems use this kind of consistency model. Since the driver seems to rely on a dma_fence to block DMA access, it looks to me like the kernel has full visibility to the 'gpu_write_something()' and 'gpu_wait_done()' markers. I think trying to use hmm_range_fault on HW that can't do HW page faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that is what amd gpu is trying to do. I haven't yet seen anything that looks like 'TLB shootdown' in i915?? Jason
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > you described? > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > the oom_reaper actually requires. In short no blocking on direct or > > > indirect dependecy on memory allocation that might sleep. > > > > It is much easier to follow with some hints on code, so the true > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > allocations, great, that constraint is now clear. > > I still do not get why do you put FS/IO into the picture. This is really > about __GFP_DIRECT_RECLAIM. Like I said this is complicated, translating "no blocking on direct or indirect dependecy on memory allocation that might sleep" into GFP flags is hard for us outside the mm community. So the contraint here is no __GFP_DIRECT_RECLAIM? I bring up FS/IO because that is what Tejun mentioned when I asked him about reclaim restrictions, and is what fs_reclaim_acquire() is already sensitive too. It is pretty confusing if we have places using the word 'reclaim' with different restrictions. :( > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > no I mean __GFP_DIRECT_RECLAIM here. > > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- illegal: lock dep assertion > > I cannot really comment on how that is achieveable by lockdep. ??? I am trying to explain this is already done and working today. The above example will already fault with lockdep enabled. This is existing debugging we can use and improve upon rather that invent new debugging. Jason
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote: > > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote: > > > > > > > I'm not really well versed in the details of our userptr, but both > > > > amdgpu and i915 wait for the gpu to complete from > > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu > > > > one, so maybe he can explain what exactly it is we're doing ... > > > > > > amdgpu is (wrongly) using hmm for something, I can't really tell what > > > it is trying to do. The calls to dma_fence under the > > > invalidate_range_start do not give me a good feeling. > > > > > > However, i915 shows all the signs of trying to follow the registration > > > cache model, it even has a nice comment in > > > i915_gem_userptr_get_pages() explaining that the races it has don't > > > matter because it is a user space bug to change the VA mapping in the > > > first place. That just screams registration cache to me. > > > > > > So it is fine to run HW that way, but if you do, there is no reason to > > > fence inside the invalidate_range end. Just orphan the DMA buffer and > > > clean it up & release the page pins when all DMA buffer refs go to > > > zero. The next access to that VA should get a new DMA buffer with the > > > right mapping. > > > > > > In other words the invalidation should be very simple without > > > complicated locking, or wait_event's. Look at hfi1 for example. > > > > This would break the today usage model of uptr and it will > > break userspace expectation ie if GPU is writting to that > > memory and that memory then the userspace want to make sure > > that it will see what the GPU write. > > How exactly? This is holding the page pin, so the only way the VA > mapping can be changed is via explicit user action. > > ie: > > gpu_write_something(va, size) > mmap(.., va, size, MMAP_FIXED); > gpu_wait_done() > > This is racy and indeterminate with both models. > > Based on the comment in i915 it appears to be going on the model that > changes to the mmap by userspace when the GPU is working on it is a > programming bug. This is reasonable, lots of systems use this kind of > consistency model. Well userspace process doing munmap(), mremap(), fork() and things like that are a bug from the i915 kernel and userspace contract point of view. But things like migration or reclaim are not cover under that contract and for those the expectation is that CPU access to the same virtual address should allow to get what was last written to it either by the GPU or the CPU. > > Since the driver seems to rely on a dma_fence to block DMA access, it > looks to me like the kernel has full visibility to the > 'gpu_write_something()' and 'gpu_wait_done()' markers. So let's only consider the case where GPU wants to write to the memory (the read only case is obviously right and not need any notifier in fact) and like above the only thing we care about is reclaim or migration (for instance because of some numa compaction) as the rest is cover by i915 userspace contract. So in the write case we do GUPfast(write=true) which will be "safe" from any concurrent CPU page table update ie if GUPfast get a reference for the page then any racing CPU page table update will not be able to migrate or reclaim the page and thus the virtual address to page association will be preserve. This is only true because of GUPfast(), now if GUPfast() fails it will fallback to the slow GUP case which make the same thing safe by taking the page table lock. Because of the reference on the page the i915 driver can forego the mmu notifier end callback. The thing here is that taking a page reference is pointless if we have better synchronization and tracking of mmu notifier. Hence converting to hmm mirror allows to avoid taking a ref on the page while still keeping the same functionality as of today. > I think trying to use hmm_range_fault on HW that can't do HW page > faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that > is what amd gpu is trying to do. > > I haven't yet seen anything that looks like 'TLB shootdown' in i915?? GPU driver have complex usage pattern the tlb shootdown is implicit once the GEM object associated with the uptr is invalidated it means next time userspace submit command against that GEM object it will have to re-validate it which means re-program the GPU page table to point to the proper address (and re-call GUP). So the whole GPU page table update is all hidden behind GEM function like i915_gem_object_unbind() (or ttm invalidate for amd/radeon). Technicaly those GPU do not support page faulting but because of the way GPU works you know that you have frequent checkpoint where you can unbind things. This is what is happening here. Also not all GPU engines can handle page fault, this is true of all GPU with page fault AFAIK (AMD and NVidia so far). This is why uptr is a limited form of SVM (share virtual memory) that can be use on all GPUs engine including the dma engine. When using the full SVM power (like hmm mirror with nouveau) this is only use in GPU engine that supports page fault (but updating the page table still require locking and waiting on acknowledge). Cheers, Jérôme
On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote: > > How exactly? This is holding the page pin, so the only way the VA > > mapping can be changed is via explicit user action. > > > > ie: > > > > gpu_write_something(va, size) > > mmap(.., va, size, MMAP_FIXED); > > gpu_wait_done() > > > > This is racy and indeterminate with both models. > > > > Based on the comment in i915 it appears to be going on the model that > > changes to the mmap by userspace when the GPU is working on it is a > > programming bug. This is reasonable, lots of systems use this kind of > > consistency model. > > Well userspace process doing munmap(), mremap(), fork() and things like > that are a bug from the i915 kernel and userspace contract point of view. > > But things like migration or reclaim are not cover under that contract > and for those the expectation is that CPU access to the same virtual address > should allow to get what was last written to it either by the GPU or the > CPU. Okay, this is a more reasonable point - I agree the i915 registration cache model precludes using migration and thus DEVICE_PRIVATE. This is a strong motivation to use the hmm approach But we started out this converstation asking if i915 is correct, and I still say a registration cache model is a functionally correct way to use notifiers. > Because of the reference on the page the i915 driver can forego the mmu > notifier end callback. The thing here is that taking a page reference > is pointless if we have better synchronization and tracking of mmu > notifier. Hence converting to hmm mirror allows to avoid taking a ref > on the page while still keeping the same functionality as of today. However, there is a huge trade off here. Drivers like this are going to have a very complicated locking inside invalidate_range_start as they must sleep waiting for dma buffer references to go to zero. > GPU driver have complex usage pattern the tlb shootdown is implicit > once the GEM object associated with the uptr is invalidated it means > next time userspace submit command against that GEM object it will > have to re-validate it which means re-program the GPU page table to > point to the proper address (and re-call GUP). I think it is a mistake to try and cram the very different approaches to notifiers into the same API. SW ref counting of DMA buffers vs HW async page faulting have totally different requirements and locking schemes. This explains why AMDGPU gets away with not using the hmm API properly, it is probably relying on its DMA refcount, not the hmm valid, to keep things in order? I think the API approach in hmm_mirror is reasonable for page faulting HW, but does not serve refcounting HW well at all. Jason
On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > > you described? > > > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > > the oom_reaper actually requires. In short no blocking on direct or > > > > indirect dependecy on memory allocation that might sleep. > > > > > > It is much easier to follow with some hints on code, so the true > > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > > allocations, great, that constraint is now clear. > > > > I still do not get why do you put FS/IO into the picture. This is really > > about __GFP_DIRECT_RECLAIM. > > Like I said this is complicated, translating "no blocking on direct or > indirect dependecy on memory allocation that might sleep" into GFP > flags is hard for us outside the mm community. > > So the contraint here is no __GFP_DIRECT_RECLAIM? OK, I am obviously failing to explain that. Sorry about that. You are right that this is not simple. Let me try again. The context we are calling !blockable notifiers from has to finish in a _finite_ amount of time (and swift is hugely appreciated by users of otherwise non-responsive system that is under OOM). We are out of memory so we cannot be blocked waiting for memory. Directly or indirectly (via a lock, waiting for an event that needs memory to finish in general). So you need to track dependency over more complicated contexts than the direct call path (think of workqueue for example). > I bring up FS/IO because that is what Tejun mentioned when I asked him > about reclaim restrictions, and is what fs_reclaim_acquire() is > already sensitive too. It is pretty confusing if we have places using > the word 'reclaim' with different restrictions. :( fs_reclaim has been invented to catch potential deadlocks when a GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of the reclaim that we have. The oom context is more restricted and it cannot depend on _any_ memory reclaim because by the time we have got to this context all the reclaim has already failed and wait some more will simply not help. > > > CPU0 CPU1 > > > mutex_lock() > > > kmalloc(GFP_KERNEL) > > > > no I mean __GFP_DIRECT_RECLAIM here. > > > > > mutex_unlock() > > > fs_reclaim_acquire() > > > mutex_lock() <- illegal: lock dep assertion > > > > I cannot really comment on how that is achieveable by lockdep. > > ??? I am trying to explain this is already done and working today. The > above example will already fault with lockdep enabled. > > This is existing debugging we can use and improve upon rather that > invent new debugging. This is what you claim and I am saying that fs_reclaim is about a restricted reclaim context and it is an ugly hack. It has proven to report false positives. Maybe it can be extended to a generic reclaim. I haven't tried that. Do not aim to try it.
On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote: > This is what you claim and I am saying that fs_reclaim is about a > restricted reclaim context and it is an ugly hack. It has proven to > report false positives. Maybe it can be extended to a generic reclaim. > I haven't tried that. Do not aim to try it. Okay, great, I think this has been very helpful, at least for me, thanks. I did not know fs_reclaim was so problematic, or the special cases about OOM 'reclaim'. On this patch, I have no general objection to enforcing drivers to be non-blocking, I'd just like to see it done with the existing lockdep can't sleep detection rather than inventing some new debugging for it. I understand this means the debugging requires lockdep enabled and will not run in production, but I'm of the view that is OK and in line with general kernel practice. The last detail is I'm still unclear what a GFP flags a blockable invalidate_range_start() should use. Is GFP_KERNEL OK? Lockdep has complained on that in past due to fs_reclaim - how do you know if it is a false positive? Thanks again, Jason
On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote: > > > This is what you claim and I am saying that fs_reclaim is about a > > restricted reclaim context and it is an ugly hack. It has proven to > > report false positives. Maybe it can be extended to a generic reclaim. > > I haven't tried that. Do not aim to try it. > > Okay, great, I think this has been very helpful, at least for me, > thanks. I did not know fs_reclaim was so problematic, or the special > cases about OOM 'reclaim'. I am happy that this is more clear now. > On this patch, I have no general objection to enforcing drivers to be > non-blocking, I'd just like to see it done with the existing lockdep > can't sleep detection rather than inventing some new debugging for it. > > I understand this means the debugging requires lockdep enabled and > will not run in production, but I'm of the view that is OK and in line > with general kernel practice. Yes and I do agree with this in general. > The last detail is I'm still unclear what a GFP flags a blockable > invalidate_range_start() should use. Is GFP_KERNEL OK? I hope I will not make this muddy again ;) invalidate_range_start in the blockable mode can use/depend on any sleepable allocation allowed in the context it is called from. So in other words it is no different from any other function in the kernel that calls into allocator. As the API is missing gfp context then I hope it is not called from any restricted contexts (except from the oom which we have !blockable for). > Lockdep has > complained on that in past due to fs_reclaim - how do you know if it > is a false positive? I would have to see the specific lockdep splat.
On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > The last detail is I'm still unclear what a GFP flags a blockable > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > I hope I will not make this muddy again ;) > invalidate_range_start in the blockable mode can use/depend on any sleepable > allocation allowed in the context it is called from. 'in the context is is called from' is the magic phrase, as invalidate_range_start is called while holding several different mm related locks. I know at least write mmap_sem and i_mmap_rwsem (write?) Can GFP_KERNEL be called while holding those locks? This is the question of indirect dependency on reclaim via locks you raised earlier. > So in other words it is no different from any other function in the > kernel that calls into allocator. As the API is missing gfp context > then I hope it is not called from any restricted contexts (except > from the oom which we have !blockable for). Yes, the callers are exactly my concern. > > Lockdep has > > complained on that in past due to fs_reclaim - how do you know if it > > is a false positive? > > I would have to see the specific lockdep splat. See below. I found it when trying to understand why the registration of the mmu notififer was so oddly coded. The situation was: down_write(&mm->mmap_sem); mm_take_all_locks(mm); kmalloc(GFP_KERNEL); <--- lockdep warning I understood Daniel said he saw this directly on a recent kernel when working with his lockdep patch? Checking myself, on todays kernel I see a call chain: shrink_all_memory fs_reclaim_acquire(sc.gfp_mask); [..] do_try_to_free_pages shrink_zones shrink_node shrink_node_memcg shrink_list shrink_active_list page_referenced rmap_walk rmap_walk_file i_mmap_lock_read down_read(i_mmap_rwsem) So it is possible that the down_read() above will block on i_mmap_rwsem being held in the caller of invalidate_range_start which is doing kmalloc(GPF_KERNEL). Is this OK? The lockdep annotation says no.. Jason commit 35cfa2b0b491c37e23527822bf365610dbb188e5 Author: Gavin Shan <shangw@linux.vnet.ibm.com> Date: Thu Oct 25 13:38:01 2012 -0700 mm/mmu_notifier: allocate mmu_notifier in advance While allocating mmu_notifier with parameter GFP_KERNEL, swap would start to work in case of tight available memory. Eventually, that would lead to a deadlock while the swap deamon swaps anonymous pages. It was caused by commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary"). ================================= [ INFO: inconsistent lock state ] 3.7.0-rc1+ #518 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes: (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0 {RECLAIM_FS-ON-W} state was registered at: mark_held_locks+0x86/0x150 lockdep_trace_alloc+0x67/0xc0 kmem_cache_alloc_trace+0x33/0x230 do_mmu_notifier_register+0x87/0x180 mmu_notifier_register+0x13/0x20 kvm_dev_ioctl+0x428/0x510 do_vfs_ioctl+0x98/0x570 sys_ioctl+0x91/0xb0 system_call_fastpath+0x16/0x1b
On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote: > > > > > This is what you claim and I am saying that fs_reclaim is about a > > > restricted reclaim context and it is an ugly hack. It has proven to > > > report false positives. Maybe it can be extended to a generic reclaim. > > > I haven't tried that. Do not aim to try it. > > > > Okay, great, I think this has been very helpful, at least for me, > > thanks. I did not know fs_reclaim was so problematic, or the special > > cases about OOM 'reclaim'. > > I am happy that this is more clear now. > > > On this patch, I have no general objection to enforcing drivers to be > > non-blocking, I'd just like to see it done with the existing lockdep > > can't sleep detection rather than inventing some new debugging for it. > > > > I understand this means the debugging requires lockdep enabled and > > will not run in production, but I'm of the view that is OK and in line > > with general kernel practice. > > Yes and I do agree with this in general. So if someone can explain to me how that works with lockdep I can of course implement it. But afaics that doesn't exist (I tried to explain that somewhere else already), and I'm no really looking forward to hacking also on lockdep for this little series. > > The last detail is I'm still unclear what a GFP flags a blockable > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > I hope I will not make this muddy again ;) > invalidate_range_start in the blockable mode can use/depend on any sleepable > allocation allowed in the context it is called from. So in other words > it is no different from any other function in the kernel that calls into > allocator. As the API is missing gfp context then I hope it is not > called from any restricted contexts (except from the oom which we have > !blockable for). Hm, that's new to me. I thought mmu notifiers very much can be called from direct reclaim paths, so you have to be extremely careful with getting back into that one. At least the lockdep splats I remember also tend to have fs_reclaim in there, that's where all the fun comes from. > > Lockdep has > > complained on that in past due to fs_reclaim - how do you know if it > > is a false positive? > > I would have to see the specific lockdep splat. I guess the lockdep annotation for invalidate_range_start carries the same risks as the fs_reclaim annotation. Still feels like worth it. -Daniel
On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > So if someone can explain to me how that works with lockdep I can of > course implement it. But afaics that doesn't exist (I tried to explain > that somewhere else already), and I'm no really looking forward to > hacking also on lockdep for this little series. Hmm, kind of looks like it is done by calling preempt_disable() Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep? Jason
On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > So if someone can explain to me how that works with lockdep I can of > > course implement it. But afaics that doesn't exist (I tried to explain > > that somewhere else already), and I'm no really looking forward to > > hacking also on lockdep for this little series. > > Hmm, kind of looks like it is done by calling preempt_disable() Yup. That was v1, then came the suggestion that disabling preemption is maybe not the best thing (the oom reaper could still run for a long time comparatively, if it's cleaning out gigabytes of process memory or what not, hence this dedicated debug infrastructure). > Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep? CONFIG_DEBUG_ATOMIC_SLEEP. Like in my patch. -Daniel
On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > I continue to struggle with this. It introduces a new kernel state > > "running preemptibly but must not call schedule()". How does this make > > any sense? > > > > Perhaps a much, much more detailed description of the oom_reaper > > situation would help out. > > The primary point here is that there is a demand of non blockable mmu > notifiers to be called when the oom reaper tears down the address space. > As the oom reaper is the primary guarantee of the oom handling forward > progress it cannot be blocked on anything that might depend on blockable > memory allocations. These are not really easy to track because they > might be indirect - e.g. notifier blocks on a lock which other context > holds while allocating memory or waiting for a flusher that needs memory > to perform its work. If such a blocking state happens that we can end up > in a silent hang with an unusable machine. > Now we hope for reasonable implementations of mmu notifiers (strong > words I know ;) and this should be relatively simple and effective catch > all tool to detect something suspicious is going on. > > Does that make the situation more clear? Yes, thanks, much. Maybe a code comment along the lines of This is on behalf of the oom reaper, specifically when it is calling the mmu notifiers. The problem is that if the notifier were to block on, for example, mutex_lock() and if the process which holds that mutex were to perform a sleeping memory allocation, the oom reaper is now blocked on completion of that memory allocation. btw, do we need task_struct.non_block_count? Perhaps the oom reaper thread could set a new PF_NONBLOCK (which would be more general than PF_OOM_REAPER). If we run out of PF_ flags, use (current == oom_reaper_th).
On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > So if someone can explain to me how that works with lockdep I can of > > > course implement it. But afaics that doesn't exist (I tried to explain > > > that somewhere else already), and I'm no really looking forward to > > > hacking also on lockdep for this little series. > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > Yup. That was v1, then came the suggestion that disabling preemption > is maybe not the best thing (the oom reaper could still run for a long > time comparatively, if it's cleaning out gigabytes of process memory > or what not, hence this dedicated debug infrastructure). Oh, I'm coming in late, sorry Anyhow, I was thinking since we agreed this can trigger on some CONFIG_DEBUG flag, something like /* This is a sleepable region, but use preempt_disable to get debugging * for calls that are not allowed to block for OOM [.. insert * Michal's explanation.. ] */ if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) preempt_disable(); ops->invalidate_range_start(); And I have also been idly mulling doing something like if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && rand && mmu_notifier_range_blockable(range)) { range->flags = 0 if (!ops->invalidate_range_start(range)) continue // Failed, try again as blockable range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE } ops->invalidate_range_start(range); Which would give coverage for this corner case without forcing OOM. Jason
On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > > So if someone can explain to me how that works with lockdep I can of > > > > course implement it. But afaics that doesn't exist (I tried to explain > > > > that somewhere else already), and I'm no really looking forward to > > > > hacking also on lockdep for this little series. > > > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > > > Yup. That was v1, then came the suggestion that disabling preemption > > is maybe not the best thing (the oom reaper could still run for a long > > time comparatively, if it's cleaning out gigabytes of process memory > > or what not, hence this dedicated debug infrastructure). > > Oh, I'm coming in late, sorry > > Anyhow, I was thinking since we agreed this can trigger on some > CONFIG_DEBUG flag, something like > > /* This is a sleepable region, but use preempt_disable to get debugging > * for calls that are not allowed to block for OOM [.. insert > * Michal's explanation.. ] */ > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) > preempt_disable(); > ops->invalidate_range_start(); I think we also discussed that, and some expressed concerns it would change behaviour/timing too much for testing. Since this does does disable preemption for real, not just for might_sleep. > And I have also been idly mulling doing something like > > if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && > rand && > mmu_notifier_range_blockable(range)) { > range->flags = 0 > if (!ops->invalidate_range_start(range)) > continue > > // Failed, try again as blockable > range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE > } > ops->invalidate_range_start(range); > > Which would give coverage for this corner case without forcing OOM. Hm, this sounds like a neat idea to slap on top. The rand is going to be a bit tricky though, but I guess for this we could stuff another counter into task_struct and just e.g. do this every 1000th or so invalidate (well need to pick a prime so we cycle through notifiers in case there's multiple). I like. Michal, thoughts? -Daniel
On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > > > The last detail is I'm still unclear what a GFP flags a blockable > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > I hope I will not make this muddy again ;) > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > allocation allowed in the context it is called from. > > 'in the context is is called from' is the magic phrase, as > invalidate_range_start is called while holding several different mm > related locks. I know at least write mmap_sem and i_mmap_rwsem > (write?) > > Can GFP_KERNEL be called while holding those locks? i_mmap_rwsem would be problematic because it is taken during the reclaim. > This is the question of indirect dependency on reclaim via locks you > raised earlier. > > > So in other words it is no different from any other function in the > > kernel that calls into allocator. As the API is missing gfp context > > then I hope it is not called from any restricted contexts (except > > from the oom which we have !blockable for). > > Yes, the callers are exactly my concern. > > > > Lockdep has > > > complained on that in past due to fs_reclaim - how do you know if it > > > is a false positive? > > > > I would have to see the specific lockdep splat. > > See below. I found it when trying to understand why the registration > of the mmu notififer was so oddly coded. > > The situation was: > > down_write(&mm->mmap_sem); > mm_take_all_locks(mm); > kmalloc(GFP_KERNEL); <--- lockdep warning Ugh. mm_take_all_locks :/ > I understood Daniel said he saw this directly on a recent kernel when > working with his lockdep patch? > > Checking myself, on todays kernel I see a call chain: > > shrink_all_memory > fs_reclaim_acquire(sc.gfp_mask); > [..] > do_try_to_free_pages > shrink_zones > shrink_node > shrink_node_memcg > shrink_list > shrink_active_list > page_referenced > rmap_walk > rmap_walk_file > i_mmap_lock_read > down_read(i_mmap_rwsem) > > So it is possible that the down_read() above will block on > i_mmap_rwsem being held in the caller of invalidate_range_start which > is doing kmalloc(GPF_KERNEL). > > Is this OK? The lockdep annotation says no.. It's not as per the above code patch which is easily possible because mm_take_all_locks will lock all file vmas.
On Thu 15-08-19 15:15:09, Andrew Morton wrote: > On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > > I continue to struggle with this. It introduces a new kernel state > > > "running preemptibly but must not call schedule()". How does this make > > > any sense? > > > > > > Perhaps a much, much more detailed description of the oom_reaper > > > situation would help out. > > > > The primary point here is that there is a demand of non blockable mmu > > notifiers to be called when the oom reaper tears down the address space. > > As the oom reaper is the primary guarantee of the oom handling forward > > progress it cannot be blocked on anything that might depend on blockable > > memory allocations. These are not really easy to track because they > > might be indirect - e.g. notifier blocks on a lock which other context > > holds while allocating memory or waiting for a flusher that needs memory > > to perform its work. If such a blocking state happens that we can end up > > in a silent hang with an unusable machine. > > Now we hope for reasonable implementations of mmu notifiers (strong > > words I know ;) and this should be relatively simple and effective catch > > all tool to detect something suspicious is going on. > > > > Does that make the situation more clear? > > Yes, thanks, much. Maybe a code comment along the lines of > > This is on behalf of the oom reaper, specifically when it is > calling the mmu notifiers. The problem is that if the notifier were > to block on, for example, mutex_lock() and if the process which holds > that mutex were to perform a sleeping memory allocation, the oom > reaper is now blocked on completion of that memory allocation. reaper is now blocked on completion of that memory allocation and cannot provide the guarantee of the OOM forward progress. OK. > btw, do we need task_struct.non_block_count? Perhaps the oom reaper > thread could set a new PF_NONBLOCK (which would be more general than > PF_OOM_REAPER). If we run out of PF_ flags, use (current == oom_reaper_th). Well, I do not have a strong opinion here. A simple check for the value seems to be trivial. There are quite some holes in task_struct to hide this counter without increasing the size.
On Thu 15-08-19 22:16:43, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > > The last detail is I'm still unclear what a GFP flags a blockable > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > I hope I will not make this muddy again ;) > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > allocation allowed in the context it is called from. So in other words > > it is no different from any other function in the kernel that calls into > > allocator. As the API is missing gfp context then I hope it is not > > called from any restricted contexts (except from the oom which we have > > !blockable for). > > Hm, that's new to me. I thought mmu notifiers very much can be called > from direct reclaim paths, so you have to be extremely careful with > getting back into that one. Correct, I should have added that notifier callbacks ideally do not allocate any memory. They can block and even that is quite a pain to be honest.
On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote: > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > > > So if someone can explain to me how that works with lockdep I can of > > > > > course implement it. But afaics that doesn't exist (I tried to explain > > > > > that somewhere else already), and I'm no really looking forward to > > > > > hacking also on lockdep for this little series. > > > > > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > > > > > Yup. That was v1, then came the suggestion that disabling preemption > > > is maybe not the best thing (the oom reaper could still run for a long > > > time comparatively, if it's cleaning out gigabytes of process memory > > > or what not, hence this dedicated debug infrastructure). > > > > Oh, I'm coming in late, sorry > > > > Anyhow, I was thinking since we agreed this can trigger on some > > CONFIG_DEBUG flag, something like > > > > /* This is a sleepable region, but use preempt_disable to get debugging > > * for calls that are not allowed to block for OOM [.. insert > > * Michal's explanation.. ] */ > > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) > > preempt_disable(); > > ops->invalidate_range_start(); > > I think we also discussed that, and some expressed concerns it would > change behaviour/timing too much for testing. Since this does does > disable preemption for real, not just for might_sleep. I don't follow, this is a debug kernel, it will have widly different timing. Further the point of this debugging on atomic_sleep is to be as timing-independent as possible since functions with rare sleeps should be guarded by might_sleep() in their common paths. I guess I don't get the push to have some low overhead debugging for this? Is there something special you are looking for? Jason
On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote: > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > > > > > The last detail is I'm still unclear what a GFP flags a blockable > > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > > > I hope I will not make this muddy again ;) > > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > > allocation allowed in the context it is called from. > > > > 'in the context is is called from' is the magic phrase, as > > invalidate_range_start is called while holding several different mm > > related locks. I know at least write mmap_sem and i_mmap_rwsem > > (write?) > > > > Can GFP_KERNEL be called while holding those locks? > > i_mmap_rwsem would be problematic because it is taken during the > reclaim. Okay.. So the fs_reclaim debugging does catch errors. Do you have any reference for what a false positive looks like? I would like to inject it into the notifier path as this is very difficult for driver authors to discover and know about, but I'm worried about your false positive remark. I think I understand we can use only GFP_ATOMIC in the notifiers, but we need a strategy to handle OOM to guarentee forward progress. This is just more bugs to fix :( Jason
On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote: > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote: > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > > > > > > > The last detail is I'm still unclear what a GFP flags a blockable > > > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > > > > > I hope I will not make this muddy again ;) > > > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > > > allocation allowed in the context it is called from. > > > > > > 'in the context is is called from' is the magic phrase, as > > > invalidate_range_start is called while holding several different mm > > > related locks. I know at least write mmap_sem and i_mmap_rwsem > > > (write?) > > > > > > Can GFP_KERNEL be called while holding those locks? > > > > i_mmap_rwsem would be problematic because it is taken during the > > reclaim. > > Okay.. So the fs_reclaim debugging does catch errors. I do not think fs_reclaim is the udnerlying mechanism to catch this deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then go down the allocation path, direct reclaim and take the lock again. Nothing really surprising. fs_reclaim is really to catch GFP_NOFS context calling into a less restricted (e.g. GFP_KERNEL allocation context). > Do you have any > reference for what a false positive looks like? I believe I have given some examples when introducing __GFP_NOLOCKDEP. > I would like to inject it into the notifier path as this is very > difficult for driver authors to discover and know about, but I'm > worried about your false positive remark. > > I think I understand we can use only GFP_ATOMIC in the notifiers, but > we need a strategy to handle OOM to guarentee forward progress. Your example is from the notifier registration IIUC. Can you pre-allocate before taking locks? Could you point me to some examples when the allocation is necessary in the range notifier callback?
On Fri, Aug 16, 2019 at 2:12 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote: > > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > > > > So if someone can explain to me how that works with lockdep I can of > > > > > > course implement it. But afaics that doesn't exist (I tried to explain > > > > > > that somewhere else already), and I'm no really looking forward to > > > > > > hacking also on lockdep for this little series. > > > > > > > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > > > > > > > Yup. That was v1, then came the suggestion that disabling preemption > > > > is maybe not the best thing (the oom reaper could still run for a long > > > > time comparatively, if it's cleaning out gigabytes of process memory > > > > or what not, hence this dedicated debug infrastructure). > > > > > > Oh, I'm coming in late, sorry > > > > > > Anyhow, I was thinking since we agreed this can trigger on some > > > CONFIG_DEBUG flag, something like > > > > > > /* This is a sleepable region, but use preempt_disable to get debugging > > > * for calls that are not allowed to block for OOM [.. insert > > > * Michal's explanation.. ] */ > > > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) > > > preempt_disable(); > > > ops->invalidate_range_start(); > > > > I think we also discussed that, and some expressed concerns it would > > change behaviour/timing too much for testing. Since this does does > > disable preemption for real, not just for might_sleep. > > I don't follow, this is a debug kernel, it will have widly different > timing. > > Further the point of this debugging on atomic_sleep is to be as > timing-independent as possible since functions with rare sleeps should > be guarded by might_sleep() in their common paths. > > I guess I don't get the push to have some low overhead debugging for > this? Is there something special you are looking for? Don't ask me, I'm just trying to get _some_ debugging for this in. I don't care one bit how much overhead it has because in our CI our debug build has lockdep and everything and it crawls anyway. I started out with the preempt_disable/enable thing like you suggested, and a few rounds later we're here. We can go back to v1 on this one, but I'd prefer to not do the lap too often. Also, aside from this patch (which is prep for the next) and some simple reordering conflicts they're all independent. So if there's no way to paint this bikeshed here (technicolor perhaps?) then I'd like to get at least the others considered. -Daniel
On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote: > Also, aside from this patch (which is prep for the next) and some > simple reordering conflicts they're all independent. So if there's no > way to paint this bikeshed here (technicolor perhaps?) then I'd like > to get at least the others considered. Sure, I think for conflict avoidance reasons I'm probably taking mmu_notifier stuff via hmm.git, so: - Andrew had a minor remark on #1, I am ambivalent and would take it as-is. Your decision if you want to respin. - #2/#3 is this issue, I would stand by the preempt_disable/etc path Our situation matches yours, debug tests run lockdep/etc. - #4 I like a lot, except the map should enclose range_end too, this can be done after the mm_has_notifiers inside the __mmu_notifier function Can you respin? I will propose preloading the map in another patch - #5 is already applied in -rc Jason
On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote: > > Also, aside from this patch (which is prep for the next) and some > > simple reordering conflicts they're all independent. So if there's no > > way to paint this bikeshed here (technicolor perhaps?) then I'd like > > to get at least the others considered. > > Sure, I think for conflict avoidance reasons I'm probably taking > mmu_notifier stuff via hmm.git, so: > > - Andrew had a minor remark on #1, I am ambivalent and would take it > as-is. Your decision if you want to respin. I like mine better, see also the reply from Ralph Campbell. > - #2/#3 is this issue, I would stand by the preempt_disable/etc path > Our situation matches yours, debug tests run lockdep/etc. Since Michal requested the current flavour I think we need spin a bit more on these here. I guess I'll just rebase them to the end so they're not holding up the others. > - #4 I like a lot, except the map should enclose range_end too, > this can be done after the mm_has_notifiers inside the > __mmu_notifier function To make sure I get this right: The same lockdep context, but also wrapped around invalidate_range_end? From my understanding of pte zapping that makes sense, but I'm definitely not well-versed enough for that. > Can you respin? Will do. > I will propose preloading the map in another patch > - #5 is already applied in -rc Yup, I'll drop that one. Thanks, Daniel
On Fri, Aug 16, 2019 at 06:36:52PM +0200, Daniel Vetter wrote: > On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote: > > > Also, aside from this patch (which is prep for the next) and some > > > simple reordering conflicts they're all independent. So if there's no > > > way to paint this bikeshed here (technicolor perhaps?) then I'd like > > > to get at least the others considered. > > > > Sure, I think for conflict avoidance reasons I'm probably taking > > mmu_notifier stuff via hmm.git, so: > > > > - Andrew had a minor remark on #1, I am ambivalent and would take it > > as-is. Your decision if you want to respin. > > I like mine better, see also the reply from Ralph Campbell. Sure > > - #2/#3 is this issue, I would stand by the preempt_disable/etc path > > Our situation matches yours, debug tests run lockdep/etc. > > Since Michal requested the current flavour I think we need spin a bit > more on these here. I guess I'll just rebase them to the end so > they're not holding up the others. > > > - #4 I like a lot, except the map should enclose range_end too, > > this can be done after the mm_has_notifiers inside the > > __mmu_notifier function > > To make sure I get this right: The same lockdep context, but also > wrapped around invalidate_range_end? Yes, the locking context of _range_start and _range_end should be identical, last time I checked callers this was the case. So, just add it to __mmu_notifier_invalidate_range_end() outside the SRCU as there is no reason to burden debug kernel callers twice when mmu notifiers are not enabled Jason
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4fa360a13c1e..915fd9888afb 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); * might_sleep - annotation for functions that can sleep * * this macro will print a stack trace if it is executed in an atomic - * context (spinlock, irq-handler, ...). + * context (spinlock, irq-handler, ...). Additional sections where blocking is + * not allowed can be annotated with non_block_start() and non_block_end() + * pairs. * * This is a useful debugging help to be able to catch problems early and not * be bitten later when the calling function happens to sleep when it is not @@ -233,6 +235,10 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define cant_sleep() \ do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) +# define non_block_start() \ + do { current->non_block_count++; } while (0) +# define non_block_end() \ + do { WARN_ON(current->non_block_count-- == 0); } while (0) #else static inline void ___might_sleep(const char *file, int line, int preempt_offset) { } @@ -241,6 +247,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define might_sleep() do { might_resched(); } while (0) # define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) +# define non_block_start() do { } while (0) +# define non_block_end() do { } while (0) #endif #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9f51932bd543..c5630f3dca1f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,10 @@ struct task_struct { struct mutex_waiter *blocked_on; #endif +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + int non_block_count; +#endif + #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; unsigned long hardirq_enable_ip; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..57245770d6cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3700,13 +3700,22 @@ static noinline void __schedule_bug(struct task_struct *prev) /* * Various schedule()-time debugging checks and statistics: */ -static inline void schedule_debug(struct task_struct *prev) +static inline void schedule_debug(struct task_struct *prev, bool preempt) { #ifdef CONFIG_SCHED_STACK_END_CHECK if (task_stack_end_corrupted(prev)) panic("corrupted stack end detected inside scheduler\n"); #endif +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + if (!preempt && prev->state && prev->non_block_count) { + printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n", + prev->comm, prev->pid, prev->non_block_count); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); + } +#endif + if (unlikely(in_atomic_preempt_off())) { __schedule_bug(prev); preempt_count_set(PREEMPT_DISABLED); @@ -3813,7 +3822,7 @@ static void __sched notrace __schedule(bool preempt) rq = cpu_rq(cpu); prev = rq->curr; - schedule_debug(prev); + schedule_debug(prev, preempt); if (sched_feat(HRTICK)) hrtick_clear(rq); @@ -6570,7 +6579,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset) rcu_sleep_check(); if ((preempt_count_equals(preempt_offset) && !irqs_disabled() && - !is_idle_task(current)) || + !is_idle_task(current) && !current->non_block_count) || system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING || oops_in_progress) return; @@ -6586,8 +6595,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset) "BUG: sleeping function called from invalid context at %s:%d\n", file, line); printk(KERN_ERR - "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", - in_atomic(), irqs_disabled(), + "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), current->non_block_count, current->pid, current->comm); if (task_stack_end_corrupted(current))