Message ID | 20200224182401.353359-2-jean-philippe@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iommu: Shared Virtual Addressing and SMMUv3 support | expand |
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > add a get/put scheme for the registration") provides a convenient way > for users to attach notifier data to an mm. However, it would be even > better to create this notifier data atomically. > > Since the alloc_notifier() callback only takes an mm argument at the > moment, some users have to perform the allocation in two times. > alloc_notifier() initially creates an incomplete structure, which is > then finalized using more context once mmu_notifier_get() returns. This > second step requires carrying an initialization lock in the notifier > data and playing dirty tricks to order memory accesses against live > invalidation. This was the intended pattern. Tthere shouldn't be an real issue as there shouldn't be any data on which to invalidate, ie the later patch does: + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) And that list is empty post-allocation, so no 'dirty tricks' required. The other op callback is release, which also cannot be called as the caller must hold a mmget to establish the notifier. So just use the locking that already exists. There is one function that calls io_mm_get() which immediately calls io_mm_attach, which immediately grabs the global iommu_sva_lock. Thus init the pasid for the first time under that lock and everything is fine. There is nothing inherently wrong with the approach in this patch, but it seems unneeded in this case.. Jason
On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > add a get/put scheme for the registration") provides a convenient way > > for users to attach notifier data to an mm. However, it would be even > > better to create this notifier data atomically. > > > > Since the alloc_notifier() callback only takes an mm argument at the > > moment, some users have to perform the allocation in two times. > > alloc_notifier() initially creates an incomplete structure, which is > > then finalized using more context once mmu_notifier_get() returns. This > > second step requires carrying an initialization lock in the notifier > > data and playing dirty tricks to order memory accesses against live > > invalidation. > > This was the intended pattern. Tthere shouldn't be an real issue as > there shouldn't be any data on which to invalidate, ie the later patch > does: > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > And that list is empty post-allocation, so no 'dirty tricks' required. Before introducing this patch I had the following code: + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { + /* + * To ensure that we observe the initialization of io_mm fields + * by io_mm_finalize() before the registration of this bond to + * the list by io_mm_attach(), introduce an address dependency + * between bond and io_mm. It pairs with the smp_store_release() + * from list_add_rcu(). + */ + io_mm = rcu_dereference(bond->io_mm); + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, + start, end - start); + } (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. (3) finally io_mm_attach() would add the bond to io_mm->devices. Since the above code can run before (2) it needs to observe valid io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond initialized by (3). Which I believe requires the address dependency from the rcu_dereference() above or some stronger barrier to pair with the list_add_rcu(). If io_mm->ctx and io_mm->ops are already valid before the mmu notifier is published, then we don't need that stuff. That's the main reason I would have liked moving everything to alloc_notifier(), the locking below isn't a big deal. > The other op callback is release, which also cannot be called as the > caller must hold a mmget to establish the notifier. > > So just use the locking that already exists. There is one function > that calls io_mm_get() which immediately calls io_mm_attach, which > immediately grabs the global iommu_sva_lock. > > Thus init the pasid for the first time under that lock and everything > is fine. I agree with this, can't remember why I used a separate lock for initialization rather than reusing iommu_sva_lock. Thanks, Jean > > There is nothing inherently wrong with the approach in this patch, but > it seems unneeded in this case.. > > Jason
On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote: > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > > add a get/put scheme for the registration") provides a convenient way > > > for users to attach notifier data to an mm. However, it would be even > > > better to create this notifier data atomically. > > > > > > Since the alloc_notifier() callback only takes an mm argument at the > > > moment, some users have to perform the allocation in two times. > > > alloc_notifier() initially creates an incomplete structure, which is > > > then finalized using more context once mmu_notifier_get() returns. This > > > second step requires carrying an initialization lock in the notifier > > > data and playing dirty tricks to order memory accesses against live > > > invalidation. > > > > This was the intended pattern. Tthere shouldn't be an real issue as > > there shouldn't be any data on which to invalidate, ie the later patch > > does: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > > > And that list is empty post-allocation, so no 'dirty tricks' required. > > Before introducing this patch I had the following code: > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > + /* > + * To ensure that we observe the initialization of io_mm fields > + * by io_mm_finalize() before the registration of this bond to > + * the list by io_mm_attach(), introduce an address dependency > + * between bond and io_mm. It pairs with the smp_store_release() > + * from list_add_rcu(). > + */ > + io_mm = rcu_dereference(bond->io_mm); A rcu_dereference isn't need here, just a normal derference is fine. > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + start, end - start); > + } > > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. > (3) finally io_mm_attach() would add the bond to io_mm->devices. > > Since the above code can run before (2) it needs to observe valid > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond > initialized by (3). Which I believe requires the address dependency from > the rcu_dereference() above or some stronger barrier to pair with the > list_add_rcu(). The list_for_each_entry_rcu() is an acquire that already pairs with the release in list_add_rcu(), all you need is a data dependency chain starting on bond to be correct on ordering. But this is super tricky :\ > If io_mm->ctx and io_mm->ops are already valid before the > mmu notifier is published, then we don't need that stuff. So, this trickyness with RCU is not a bad reason to introduce the priv scheme, maybe explain it in the commit message? Jason
On Tue, Feb 25, 2020 at 10:08:14AM -0400, Jason Gunthorpe wrote: > On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote: > > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > > > add a get/put scheme for the registration") provides a convenient way > > > > for users to attach notifier data to an mm. However, it would be even > > > > better to create this notifier data atomically. > > > > > > > > Since the alloc_notifier() callback only takes an mm argument at the > > > > moment, some users have to perform the allocation in two times. > > > > alloc_notifier() initially creates an incomplete structure, which is > > > > then finalized using more context once mmu_notifier_get() returns. This > > > > second step requires carrying an initialization lock in the notifier > > > > data and playing dirty tricks to order memory accesses against live > > > > invalidation. > > > > > > This was the intended pattern. Tthere shouldn't be an real issue as > > > there shouldn't be any data on which to invalidate, ie the later patch > > > does: > > > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > > > > > And that list is empty post-allocation, so no 'dirty tricks' required. > > > > Before introducing this patch I had the following code: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > + /* > > + * To ensure that we observe the initialization of io_mm fields > > + * by io_mm_finalize() before the registration of this bond to > > + * the list by io_mm_attach(), introduce an address dependency > > + * between bond and io_mm. It pairs with the smp_store_release() > > + * from list_add_rcu(). > > + */ > > + io_mm = rcu_dereference(bond->io_mm); > > A rcu_dereference isn't need here, just a normal derference is fine. bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), which does bond->io_mm under rcu_read_lock()) > > > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, > > + start, end - start); > > + } > > > > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). > > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. > > (3) finally io_mm_attach() would add the bond to io_mm->devices. > > > > Since the above code can run before (2) it needs to observe valid > > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond > > initialized by (3). Which I believe requires the address dependency from > > the rcu_dereference() above or some stronger barrier to pair with the > > list_add_rcu(). > > The list_for_each_entry_rcu() is an acquire that already pairs with > the release in list_add_rcu(), all you need is a data dependency chain > starting on bond to be correct on ordering. > > But this is super tricky :\ > > > If io_mm->ctx and io_mm->ops are already valid before the > > mmu notifier is published, then we don't need that stuff. > > So, this trickyness with RCU is not a bad reason to introduce the priv > scheme, maybe explain it in the commit message? Ok, I've added this to the commit message: The IOMMU SVA module, which attaches an mm to multiple devices, exemplifies this situation. In essence it does: mmu_notifier_get() alloc_notifier() A = kzalloc() /* MMU notifier is published */ A->ctx = ctx; // (1) device->A = A; list_add_rcu(device, A->devices); // (2) The invalidate notifier, which may start running before A is fully initialized at (1), does the following: io_mm_invalidate(A) list_for_each_entry_rcu(device, A->devices) A = device->A; // (3) device->invalidate(A->ctx) To ensure that an invalidate() thread observes write (1) before (2), it needs the address dependency (3). The resulting code is subtle and difficult to understand. If instead we fully initialize object A before publishing the MMU notifier, we don't need the complexity added by (3). I'll try to improve the wording before sending next version. Thanks, Jean
On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > + /* > > > + * To ensure that we observe the initialization of io_mm fields > > > + * by io_mm_finalize() before the registration of this bond to > > > + * the list by io_mm_attach(), introduce an address dependency > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > + * from list_add_rcu(). > > > + */ > > > + io_mm = rcu_dereference(bond->io_mm); > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > which does bond->io_mm under rcu_read_lock()) I'm surprised the bond->io_mm can change over the lifetime of the bond memory.. > > > If io_mm->ctx and io_mm->ops are already valid before the > > > mmu notifier is published, then we don't need that stuff. > > > > So, this trickyness with RCU is not a bad reason to introduce the priv > > scheme, maybe explain it in the commit message? > > Ok, I've added this to the commit message: > > The IOMMU SVA module, which attaches an mm to multiple devices, > exemplifies this situation. In essence it does: > > mmu_notifier_get() > alloc_notifier() > A = kzalloc() > /* MMU notifier is published */ > A->ctx = ctx; // (1) > device->A = A; > list_add_rcu(device, A->devices); // (2) > > The invalidate notifier, which may start running before A is fully > initialized at (1), does the following: > > io_mm_invalidate(A) > list_for_each_entry_rcu(device, A->devices) > A = device->A; // (3) I would drop the work around from the decription, it is enough to say that the line below needs to observe (1) after (2) and this is trivially achieved by moving (1) to before publishing the notifier so the core MM locking can be used. Regards, Jason
On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote: > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > > + /* > > > > + * To ensure that we observe the initialization of io_mm fields > > > > + * by io_mm_finalize() before the registration of this bond to > > > > + * the list by io_mm_attach(), introduce an address dependency > > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > > + * from list_add_rcu(). > > > > + */ > > > > + io_mm = rcu_dereference(bond->io_mm); > > > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > > which does bond->io_mm under rcu_read_lock()) > > I'm surprised the bond->io_mm can change over the lifetime of the > bond memory.. The normal lifetime of the bond is between device driver calls to bind() and unbind(). If the mm exits early, though, we clear bond->io_mm. The bond is then stale but can only be freed when the device driver releases it with unbind(). > > > > > If io_mm->ctx and io_mm->ops are already valid before the > > > > mmu notifier is published, then we don't need that stuff. > > > > > > So, this trickyness with RCU is not a bad reason to introduce the priv > > > scheme, maybe explain it in the commit message? > > > > Ok, I've added this to the commit message: > > > > The IOMMU SVA module, which attaches an mm to multiple devices, > > exemplifies this situation. In essence it does: > > > > mmu_notifier_get() > > alloc_notifier() > > A = kzalloc() > > /* MMU notifier is published */ > > A->ctx = ctx; // (1) > > device->A = A; > > list_add_rcu(device, A->devices); // (2) > > > > The invalidate notifier, which may start running before A is fully > > initialized at (1), does the following: > > > > io_mm_invalidate(A) > > list_for_each_entry_rcu(device, A->devices) > > A = device->A; // (3) > > I would drop the work around from the decription, it is enough to say > that the line below needs to observe (1) after (2) and this is > trivially achieved by moving (1) to before publishing the notifier so > the core MM locking can be used. Ok, will do Thanks, Jean
On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote: > On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote: > > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > > > + /* > > > > > + * To ensure that we observe the initialization of io_mm fields > > > > > + * by io_mm_finalize() before the registration of this bond to > > > > > + * the list by io_mm_attach(), introduce an address dependency > > > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > > > + * from list_add_rcu(). > > > > > + */ > > > > > + io_mm = rcu_dereference(bond->io_mm); > > > > > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > > > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > > > which does bond->io_mm under rcu_read_lock()) > > > > I'm surprised the bond->io_mm can change over the lifetime of the > > bond memory.. > > The normal lifetime of the bond is between device driver calls to bind() > and unbind(). If the mm exits early, though, we clear bond->io_mm. The > bond is then stale but can only be freed when the device driver releases > it with unbind(). I usually advocate for simple use of these APIs. The mm_notifier_get() should happen in bind() and the matching put should happen in the call_rcu callbcak that does the kfree. Then you can never get a stale pointer. Don't worry about exit_mmap(). release() is an unusual callback and I see alot of places using it wrong. The purpose of release is to invalidate_all, that is it. Also, confusingly release may be called multiple times in some situations, so it shouldn't disturb anything that might impact a 2nd call. Jason
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm) > +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata) Pleae don't introduce any > 80 char lines. Not here, and not anywhere else in this patch or the series.
On Fri, Feb 28, 2020 at 11:13:40AM -0400, Jason Gunthorpe wrote: > On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote: > > > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > > > > + /* > > > > > > + * To ensure that we observe the initialization of io_mm fields > > > > > > + * by io_mm_finalize() before the registration of this bond to > > > > > > + * the list by io_mm_attach(), introduce an address dependency > > > > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > > > > + * from list_add_rcu(). > > > > > > + */ > > > > > > + io_mm = rcu_dereference(bond->io_mm); > > > > > > > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > > > > > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > > > > which does bond->io_mm under rcu_read_lock()) > > > > > > I'm surprised the bond->io_mm can change over the lifetime of the > > > bond memory.. > > > > The normal lifetime of the bond is between device driver calls to bind() > > and unbind(). If the mm exits early, though, we clear bond->io_mm. The > > bond is then stale but can only be freed when the device driver releases > > it with unbind(). > > I usually advocate for simple use of these APIs. The mm_notifier_get() > should happen in bind() and the matching put should happen in the > call_rcu callbcak that does the kfree. I tried to keep it simple like that: normally mmu_notifier_get() is called in bind(), and mmu_notifier_put() is called in unbind(). Multiple device drivers may call bind() with the same mm. Each bind() calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond (a device<->mm link). Each bond is freed by calling unbind(), which calls mmu_notifier_put(). That's the most common case. Now if the process is killed and the mm disappears, we do need to avoid use-after-free caused by DMA of the mappings and the page tables. So the release() callback, before doing invalidate_all, stops DMA and clears the page table pointer on the IOMMU side. It detaches all bonds from the io_mm, calling mmu_notifier_put() for each of them. After release(), bond objects still exists and device drivers still need to free them with unbind(), but they don't point to an io_mm anymore. > Then you can never get a stale > pointer. Don't worry about exit_mmap(). > > release() is an unusual callback and I see alot of places using it > wrong. The purpose of release is to invalidate_all, that is it. > > Also, confusingly release may be called multiple times in some > situations, so it shouldn't disturb anything that might impact a 2nd > call. I hadn't realized that. The current implementation should be safe against it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you have an example of such a situation? I'm trying to write tests for this kind of corner cases. Thanks, Jean
On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > I tried to keep it simple like that: normally mmu_notifier_get() is called > in bind(), and mmu_notifier_put() is called in unbind(). > > Multiple device drivers may call bind() with the same mm. Each bind() > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > (a device<->mm link). Each bond is freed by calling unbind(), which calls > mmu_notifier_put(). > > That's the most common case. Now if the process is killed and the mm > disappears, we do need to avoid use-after-free caused by DMA of the > mappings and the page tables. This is why release must do invalidate all - but it doesn't need to do any more - as no SPTE can be established without a mmget() - and mmget() is no longer possible past release. > So the release() callback, before doing invalidate_all, stops DMA > and clears the page table pointer on the IOMMU side. It detaches all > bonds from the io_mm, calling mmu_notifier_put() for each of > them. After release(), bond objects still exists and device drivers > still need to free them with unbind(), but they don't point to an > io_mm anymore. Why is so much work needed in release? It really should just be invalidate all, usually trying to sort out all the locking for the more complicated stuff is not worthwhile. If other stuff is implicitly relying on the mm being alive and release to fence against that then it is already racy. If it doesn't, then why bother doing complicated work in release? > > Then you can never get a stale > > pointer. Don't worry about exit_mmap(). > > > > release() is an unusual callback and I see alot of places using it > > wrong. The purpose of release is to invalidate_all, that is it. > > > > Also, confusingly release may be called multiple times in some > > situations, so it shouldn't disturb anything that might impact a 2nd > > call. > > I hadn't realized that. The current implementation should be safe against > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you > have an example of such a situation? I'm trying to write tests for this > kind of corner cases. Hmm, let me think. Ah, you have to be using mmu_notifier_unregister() to get that race. This is one of the things that get/put don't suffer from - but they conversely don't guarantee that release() will be called, so it is up to the caller to ensure everything is fenced before calling put. Jason
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > mmu_notifier_put(). > > > > That's the most common case. Now if the process is killed and the mm > > disappears, we do need to avoid use-after-free caused by DMA of the > > mappings and the page tables. > > This is why release must do invalidate all - but it doesn't need to do > any more - as no SPTE can be established without a mmget() - and > mmget() is no longer possible past release. In our case we don't have SPTEs, the whole pgd is shared between MMU and IOMMU (isolated using PASID tables). Taking the concrete example of the crypto accelerator: 1. A process opens a queue in the accelerator. That queue is bound to the address space: a PASID is allocated for the mm, and mm->pgd is written into the IOMMU PASID table. 2. The process queues some work and waits. In the background, the accelerators performs DMA on the process address space, by using the mm's PASID. 3. Now the process gets killed, and release() is called. At this point no one told the device to stop working on this queue, it may still be doing DMA on this address space. So the first thing we do is notify the device driver that the bond is going away, and that it must stop the queue and flush remaining DMA transactions for this PASID. Then we also clear the pgd from the IOMMU PASID table. If we only did invalidate-all and somehow the queue wasn't properly stopped, concurrent DMA would immediately form new IOTLB entries since the page tables haven't been wiped at this point. And later, it would use-after-free page tables and mappings. Whereas with a clear pgd it would just generate IOMMU fault events, which are undesirable but harmless. Thanks, Jean > > So the release() callback, before doing invalidate_all, stops DMA > > and clears the page table pointer on the IOMMU side. It detaches all > > bonds from the io_mm, calling mmu_notifier_put() for each of > > them. After release(), bond objects still exists and device drivers > > still need to free them with unbind(), but they don't point to an > > io_mm anymore. > > Why is so much work needed in release? It really should just be > invalidate all, usually trying to sort out all the locking for the > more complicated stuff is not worthwhile. > > If other stuff is implicitly relying on the mm being alive and release > to fence against that then it is already racy. If it doesn't, then why > bother doing complicated work in release? > > > > Then you can never get a stale > > > pointer. Don't worry about exit_mmap(). > > > > > > release() is an unusual callback and I see alot of places using it > > > wrong. The purpose of release is to invalidate_all, that is it. > > > > > > Also, confusingly release may be called multiple times in some > > > situations, so it shouldn't disturb anything that might impact a 2nd > > > call. > > > > I hadn't realized that. The current implementation should be safe against > > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you > > have an example of such a situation? I'm trying to write tests for this > > kind of corner cases. > > Hmm, let me think. Ah, you have to be using mmu_notifier_unregister() > to get that race. This is one of the things that get/put don't suffer > from - but they conversely don't guarantee that release() will be > called, so it is up to the caller to ensure everything is fenced > before calling put. > > Jason
On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > mmu_notifier_put(). > > > > > > That's the most common case. Now if the process is killed and the mm > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > mappings and the page tables. > > > > This is why release must do invalidate all - but it doesn't need to do > > any more - as no SPTE can be established without a mmget() - and > > mmget() is no longer possible past release. > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > IOMMU (isolated using PASID tables). Okay, but this just means that 'invalidate all' also requires switching the PASID to use some pgd that is permanently 'all fail'. > At this point no one told the device to stop working on this queue, > it may still be doing DMA on this address space. Sure, but there are lots of cases where a defective user space can cause pages under active DMA to disappear, like munmap for instance. Process exit is really no different, the PASID should take errors and the device & driver should do whatever error flow it has. Involving a complex driver flow in the exit_mmap path seems like dangerous complexity to me. Jason
On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > mmu_notifier_put(). > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > mappings and the page tables. > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > any more - as no SPTE can be established without a mmget() - and > > > mmget() is no longer possible past release. > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > IOMMU (isolated using PASID tables). > > Okay, but this just means that 'invalidate all' also requires > switching the PASID to use some pgd that is permanently 'all fail'. > > > At this point no one told the device to stop working on this queue, > > it may still be doing DMA on this address space. > > Sure, but there are lots of cases where a defective user space can > cause pages under active DMA to disappear, like munmap for > instance. Process exit is really no different, the PASID should take > errors and the device & driver should do whatever error flow it has. We do have the possibility to shut things down in order, so to me this feels like a band-aid. The idea has come up before though [1], and I'm not strongly opposed to this model, but I'm still not convinced it's necessary. It does add more complexity to IOMMU drivers, to avoid printing out the errors that we wouldn't otherwise see, whereas device drivers need in any case to implement the logic that forces stop DMA. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/4d68da96-0ad5-b412-5987-2f7a6aa796c3@amd.com/ > > Involving a complex driver flow in the exit_mmap path seems like > dangerous complexity to me. > > Jason
On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > mmu_notifier_put(). > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > mappings and the page tables. > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > any more - as no SPTE can be established without a mmget() - and > > > > mmget() is no longer possible past release. > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > IOMMU (isolated using PASID tables). > > > > Okay, but this just means that 'invalidate all' also requires > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > At this point no one told the device to stop working on this queue, > > > it may still be doing DMA on this address space. > > > > Sure, but there are lots of cases where a defective user space can > > cause pages under active DMA to disappear, like munmap for > > instance. Process exit is really no different, the PASID should take > > errors and the device & driver should do whatever error flow it has. > > We do have the possibility to shut things down in order, so to me this > feels like a band-aid. ->release() is called by exit_mmap which is called by mmput. There are over a 100 callsites to mmput() and I'm not totally sure what the rules are for release(). We've run into problems before with things like this. IMHO, due to this, it is best for release to be simple and have conservative requirements on context like all the other notifier callbacks. It is is not a good place to put complex HW fencing driver code. In particular that link you referenced is suggesting the driver tear down could take minutes - IMHO it is not OK to block mmput() for minutes. > The idea has come up before though [1], and I'm not strongly opposed > to this model, but I'm still not convinced it's necessary. It does > add more complexity to IOMMU drivers, to avoid printing out the > errors that we wouldn't otherwise see, whereas device drivers need > in any case to implement the logic that forces stop DMA. Errors should not be printed to the kernel log for PASID cases anyhow. PASID will be used by unpriv user, and unpriv user should not be able to trigger kernel prints at will, eg by doing dma to nmap VA or whatever. Process exit is just another case of this, and should not be treated specially. Jason
On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > > mmu_notifier_put(). > > > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > > mappings and the page tables. > > > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > > any more - as no SPTE can be established without a mmget() - and > > > > > mmget() is no longer possible past release. > > > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > > IOMMU (isolated using PASID tables). > > > > > > Okay, but this just means that 'invalidate all' also requires > > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > > > At this point no one told the device to stop working on this queue, > > > > it may still be doing DMA on this address space. > > > > > > Sure, but there are lots of cases where a defective user space can > > > cause pages under active DMA to disappear, like munmap for > > > instance. Process exit is really no different, the PASID should take > > > errors and the device & driver should do whatever error flow it has. > > > > We do have the possibility to shut things down in order, so to me this > > feels like a band-aid. > > ->release() is called by exit_mmap which is called by mmput. There are > over a 100 callsites to mmput() and I'm not totally sure what the > rules are for release(). We've run into problems before with things > like this. A concrete example of something that could go badly if mmput() takes too long would greatly help. Otherwise I'll have a hard time justifying the added complexity. I wrote a prototype that removes the device driver callback from release(). It works with SMMUv3, but complicates the PASID descriptor code, which is already awful with my recent changes and this series. > IMHO, due to this, it is best for release to be simple and have > conservative requirements on context like all the other notifier > callbacks. It is is not a good place to put complex HW fencing driver > code. > > In particular that link you referenced is suggesting the driver tear > down could take minutes - IMHO it is not OK to block mmput() for > minutes. > > > The idea has come up before though [1], and I'm not strongly opposed > > to this model, but I'm still not convinced it's necessary. It does > > add more complexity to IOMMU drivers, to avoid printing out the > > errors that we wouldn't otherwise see, whereas device drivers need > > in any case to implement the logic that forces stop DMA. > > Errors should not be printed to the kernel log for PASID cases > anyhow. PASID will be used by unpriv user, and unpriv user should not > be able to trigger kernel prints at will, eg by doing dma to nmap VA > or whatever. I agree. There is a difference, though, between invalid mappings and the absence of a pgd. The former comes from userspace issuing DMA on unmapped buffers, while the latter is typically a device/driver error which normally needs to be reported. On Arm SMMUv3 they are handled differently by the hardware. But instead of disabling the whole PASID context on mm exit, we can quietly abort incoming transactions while waiting for unbind(). And I think the other IOMMUs treat invalid PASID descriptor the same as invalid translation table descriptor. At least VT-d quietly returns a no-translation response to ATS TR and rejects PRI PR. I haven't found the equivalent in the AMD IOMMU spec yet. Thanks, Jean
On Fri, Mar 13, 2020 at 07:49:29PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > > > mmu_notifier_put(). > > > > > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > > > mappings and the page tables. > > > > > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > > > any more - as no SPTE can be established without a mmget() - and > > > > > > mmget() is no longer possible past release. > > > > > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > > > IOMMU (isolated using PASID tables). > > > > > > > > Okay, but this just means that 'invalidate all' also requires > > > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > > > > > At this point no one told the device to stop working on this queue, > > > > > it may still be doing DMA on this address space. > > > > > > > > Sure, but there are lots of cases where a defective user space can > > > > cause pages under active DMA to disappear, like munmap for > > > > instance. Process exit is really no different, the PASID should take > > > > errors and the device & driver should do whatever error flow it has. > > > > > > We do have the possibility to shut things down in order, so to me this > > > feels like a band-aid. > > > > ->release() is called by exit_mmap which is called by mmput. There are > > over a 100 callsites to mmput() and I'm not totally sure what the > > rules are for release(). We've run into problems before with things > > like this. > > A concrete example of something that could go badly if mmput() takes too > long would greatly help. Otherwise I'll have a hard time justifying the > added complexity. It is not just takes too long, but also accidently causing locking problems by doing very complex code in the release callback. Unless you audit all the mmput call sites to define the calling conditions I can't even say what the risk is here. Particularly, calling something with impossible to audit locking like the dma_fence stuff from release is probably impossible to prove safety and then keep safe. It is easy enough to see where takes too long can have a bad impact, mmput is called all over the place. Just in the RDMA code slowing it down would block ODP page faulting completely for all processes. This is not acceptable. For this reason release callbacks must be simple/fast and must have trivial locking. > > Errors should not be printed to the kernel log for PASID cases > > anyhow. PASID will be used by unpriv user, and unpriv user should not > > be able to trigger kernel prints at will, eg by doing dma to nmap VA > > or whatever. > > I agree. There is a difference, though, between invalid mappings and the > absence of a pgd. The former comes from userspace issuing DMA on unmapped > buffers, while the latter is typically a device/driver error which > normally needs to be reported. Why not make the pgd present as I suggested? Point it at a static dummy pgd that always fails to page fault during release? Make the pgd not present only once the PASID is fully destroyed. That really is the only thing release is supposed to mean -> unmap all VAs. Jason
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > This is why release must do invalidate all - but it doesn't need to do > any more - as no SPTE can be established without a mmget() - and > mmget() is no longer possible past release. Maybe we should rename the release method to invalidate_all?
On Mon, Mar 16, 2020 at 08:46:59AM -0700, Christoph Hellwig wrote: > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > This is why release must do invalidate all - but it doesn't need to do > > any more - as no SPTE can be established without a mmget() - and > > mmget() is no longer possible past release. > > Maybe we should rename the release method to invalidate_all? It is a better name. The function it must also fence future access if the mirror is not using mmget(), and stop using the pgd/etc pointer if the page tables are accessed directly. Jason
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c index 10921cd2608d..77610e1704f6 100644 --- a/drivers/misc/sgi-gru/grutlbpurge.c +++ b/drivers/misc/sgi-gru/grutlbpurge.c @@ -235,7 +235,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn, gms, range->start, range->end); } -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm) +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata) { struct gru_mm_struct *gms; @@ -266,7 +266,7 @@ struct gru_mm_struct *gru_register_mmu_notifier(void) { struct mmu_notifier *mn; - mn = mmu_notifier_get_locked(&gru_mmuops, current->mm); + mn = mmu_notifier_get_locked(&gru_mmuops, current->mm, NULL); if (IS_ERR(mn)) return ERR_CAST(mn); diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 736f6918335e..06e68fa2b019 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -207,7 +207,7 @@ struct mmu_notifier_ops { * callbacks are currently running. It is called from a SRCU callback * and cannot sleep. */ - struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm); + struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm, void *privdata); void (*free_notifier)(struct mmu_notifier *subscription); }; @@ -271,14 +271,16 @@ static inline int mm_has_notifiers(struct mm_struct *mm) } struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, - struct mm_struct *mm); + struct mm_struct *mm, + void *privdata); static inline struct mmu_notifier * -mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm) +mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm, + void *privdata) { struct mmu_notifier *ret; down_write(&mm->mmap_sem); - ret = mmu_notifier_get_locked(ops, mm); + ret = mmu_notifier_get_locked(ops, mm, privdata); up_write(&mm->mmap_sem); return ret; } diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index ef3973a5d34a..8beb9dcbe0fd 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -734,6 +734,7 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) * the mm & ops * @ops: The operations struct being subscribe with * @mm : The mm to attach notifiers too + * @privdata: Initialization data passed down to ops->alloc_notifier() * * This function either allocates a new mmu_notifier via * ops->alloc_notifier(), or returns an already existing notifier on the @@ -747,7 +748,8 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) * and can be converted to an active mm pointer via mmget_not_zero(). */ struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, - struct mm_struct *mm) + struct mm_struct *mm, + void *privdata) { struct mmu_notifier *subscription; int ret; @@ -760,7 +762,7 @@ struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, return subscription; } - subscription = ops->alloc_notifier(mm); + subscription = ops->alloc_notifier(mm, privdata); if (IS_ERR(subscription)) return subscription; subscription->ops = ops;
The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: add a get/put scheme for the registration") provides a convenient way for users to attach notifier data to an mm. However, it would be even better to create this notifier data atomically. Since the alloc_notifier() callback only takes an mm argument at the moment, some users have to perform the allocation in two times. alloc_notifier() initially creates an incomplete structure, which is then finalized using more context once mmu_notifier_get() returns. This second step requires carrying an initialization lock in the notifier data and playing dirty tricks to order memory accesses against live invalidation. To simplify MMU notifier allocation, pass an allocation context to mmu_notifier_get(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/misc/sgi-gru/grutlbpurge.c | 4 ++-- include/linux/mmu_notifier.h | 10 ++++++---- mm/mmu_notifier.c | 6 ++++-- 3 files changed, 12 insertions(+), 8 deletions(-)